阿布云

你所需要的,不仅仅是一个好用的代理。

抓取知乎百万用户信息之爬虫模块

阿布云 发表于

点击我前往Github查看源代码   别忘记star

 

public class UserManage { private string html; private string url_token; }

复制代码

构造函数

 

用户主页的uRL格式为"https://www.zhihu.com/people/"+url_token+"/following";

public UserManage(string urltoken) { url_token = urltoken; }

先封装一个获取html页面的方法

 

复制代码

private bool GetHtml() { string url="https://www.zhihu.com/people/"+url_token+"/following"; html = HttpHelp.DownLoadString(url); return !string.IsNullOrEmpty(html); }

复制代码

拿到了html页面,接下来是剥取页面中的JSON,借助HtmlAgilityPack

复制代码

public void analyse() { if (GetHtml()) { try { Stopwatch watch = new Stopwatch(); watch.Start(); HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); HtmlNode node = doc.GetElementbyId("data"); StringBuilder stringbuilder =new StringBuilder(node.GetAttributeValue("data-state", "")); stringbuilder.Replace("&quot;", "'"); stringbuilder.Replace("&lt;", "<"); stringbuilder.Replace("&gt;", ">"); watch.Stop(); Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString()); } catch (Exception ex) { Console.WriteLine(ex.ToString()); } } }

复制代码

添加用户的关注列表的链接

复制代码

private void GetUserFlowerandNext(string json) { string foollowed = "https://www.zhihu.com/api/v4/members/" + url_token + "/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20"; string following = "https://www.zhihu.com/api/v4/members/" + url_token + "/followees?include=data%5B%2A%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=0"; RedisCore.PushIntoList(1, "nexturl", following); RedisCore.PushIntoList(1, "nexturl", foollowed); }

复制代码

 

对json数据进一步剥取,只要用户的信息,借助JSON解析工具Newtonsoft.Json

 

复制代码

private void GetUserInformation(string json) { JObject obj = JObject.Parse(json); string xpath = "['" + url_token + "']"; JToken tocken = obj.SelectToken("['entities']").SelectToken("['users']").SelectToken(xpath); RedisCore.PushIntoList(2, "User", tocken.ToString()); }

 

 

现在来完成下analyse函数

复制代码

public void analyse() { if (GetHtml()) { try { Stopwatch watch = new Stopwatch(); watch.Start(); HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); HtmlNode node = doc.GetElementbyId("data"); StringBuilder stringbuilder =new StringBuilder(node.GetAttributeValue("data-state", "")); stringbuilder.Replace(""", "'"); stringbuilder.Replace("<", "<"); stringbuilder.Replace(">", ">"); GetUserInformation(stringbuilder.ToString()); GetUserFlowerandNext(stringbuilder.ToString()); watch.Stop(); Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString()); } catch (Exception ex) { Console.WriteLine(ex.ToString()); } } } }

  

UrlTask是从nexturl队列获取用户的关注列表的url,获取关注列表。服务器返回的Json的数据

封装一个对象的序列化和反序列化的类

复制代码

public class SerializeHelper { /// <summary> /// 对数据进行序列化 /// </summary> /// <param name="value"></param> /// <returns></returns> public static string SerializeToString(object value) { return JsonConvert.SerializeObject(value); } /// <summary> /// 反序列化操作 /// </summary> /// <typeparam name="T"></typeparam> /// <param name="str"></param> /// <returns></returns> public static T DeserializeToObject<T>(string str) { return JsonConvert.DeserializeObject<T>(str); } }

封装UrlTask类

复制代码

public class UrlTask { private string url { get; set; } private string JSONstring { get; set; } public UrlTask(string _url) { url = _url; } }

复制代码

添加一个获取资源的方法

private bool GetHtml() { JSONstring= HttpHelp.DownLoadString(url); Console.WriteLine("Json下载完成"); return !string.IsNullOrEmpty(JSONstring); } 解析json方法

复制代码

public void Analyse() { try { if (GetHtml()) { Stopwatch watch = new Stopwatch(); watch.Start(); followerResult result = SerializeHelper.DeserializeToObject<followerResult>(JSONstring); if (!result.paging.is_end) { RedisCore.PushIntoList(1, "nexturl", result.paging.next); } foreach (var item in result.data) { int type=Math.Abs(item.GetHashCode())% 3 + 3; if (RedisCore.InsetIntoHash(type, "urltokenhash", item.url_token, "存在")) { RedisCore.PushIntoList(1, "urltoken", item.url_token); } } watch.Stop(); Console.WriteLine("解析json用了{0}毫秒",watch.ElapsedMilliseconds.ToString()); } } catch (Exception ex) { Console.WriteLine(ex.ToString()); } }

复制代码

解析:如果result.paging.is_end为true,那么这个是用户关注列表的最后一页,那么它的nexturl应该加入队列,负责不要加入,对于后面的用户数组,因为信息不去全,不要了,有了Id前往主页获取详细信息。

复制代码

 模块组合

封装一个一个方法,从队列拿到nextutl,前往用户的关注列表,拿到更多用户ID

复制代码

private static void GetNexturl() { string nexturl = RedisCore.PopFromList(1, "nexturl"); if (!string.IsNullOrEmpty(nexturl)) { UrlTask task = new UrlTask(nexturl); task.Analyse(); } }

复制代码

封装一个方法,循环从队列获取用户的urltoken(如果队列空了,执行GetNexturl),前往用户主页,获取信息

复制代码

private static void GetUser(object data) { while (true) { string url_token = RedisCore.PopFromList(1, "urltoken"); Console.WriteLine(url_token); if (!string.IsNullOrEmpty(url_token)) { UserManage manage = new UserManage(url_token); manage.analyse(); } else { GetNexturl(); } } }

复制代码

在main函数里面执行这些方法,由于任务量大,采用多线程,线程数视情况而定

 

复制代码

for (int i = 0; i < 10; i++) { ThreadPool.QueueUserWorkItem(GetUser); }

复制代码

 

添加种子数据,用于刚开始时候队列都是空的,需要添加种子数据

  1. 手动添加,在redile-cl.exe敲命令
  2. 在main函数中加入

    

UserTask task=new UserTask(“某个用户的uRLtoken”); task.analyse();

 

执行一次之后要注释掉,避免重复

反爬虫问题

   抓取过于频繁,服务器返回429.这个时候需要切换代理IP了,推荐使用阿布云代理,阿布云代理IP,提供高匿代理,爬虫代理.