阿布云

你所需要的,不仅仅是一个好用的代理。

抓取知乎百万用户信息之Redis篇

阿布云 发表于

<div id="cnblogs_post_body"><p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="https://github.com/wangqifan/"><img src="http://images2015.cnblogs.com/blog/814953/201702/814953-20170227113456829-1370229104.png" alt="" width="50" height="50">点击我前往Github查看源代码</a>&nbsp;&nbsp; 别忘记star</p>
<p>本项目github地址:https://github.com/wangqifan/ZhiHu&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</p>
<p>Redis安装</p>
<p>&nbsp;Redis官方并没有推出windows版本,人家觉得linux已经够了,开发windows版本影响开发进度,还好微软有一个团队维持着Redis的windows版本,网上有很多介绍Redis安装的博客,大多数是敲各种命令行。这里有Redis的msi版本,只需要像安装普通软件一样点击下一步,下一步即可地址:https://github.com/MSOpenTech/redis/releases/download/win-3.2.100/Redis-x64-3.2.100.msi</p>
<p>RRedis配置</p>
<p>Redis配置文件详解 http://www.cnblogs.com/kreo/p/4423362.html</p>
<p>找到Redis.windowserver.conf</p>
<p>&nbsp;</p>
<p><img src="http://images2015.cnblogs.com/blog/814953/201701/814953-20170108112435581-179605722.png" alt=""></p>
<p>这里要注意的两点:1.远程连接</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre><span style="color: #000000">#
# ~~~ WARNING ~~~ If the computer running Redis is directly exposed to the
# internet, binding to all the interfaces is dangerous and will expose the
# instance to everybody on the internet. So by default we uncomment the
# following bind directive, that will force Redis to listen only into
# the IPv4 lookback interface address (this means Redis will be able to
# accept connections only from clients running into the same computer it
# is running).
#
# IF YOU ARE SURE YOU WANT YOUR INSTANCE TO LISTEN TO ALL THE INTERFACES
# JUST COMMENT THE FOLLOWING LINE.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bind 0.0.0.0
将bind 127.0.0.1 修改成bind 0.0.0.0这样redis可以接受远程连接</span></pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p>内存限制</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre><span style="color: #000000"># NOTE: since Redis uses the system paging file to allocate the heap memory,
# the Working Set memory usage showed by the Windows Task Manager or by other
# tools such as ProcessExplorer will not always be accurate. For example, right
# after a background save of the RDB or the AOF files, the working set value
# may drop significantly. In order to check the correct amount of memory used
# by the redis-server to store the data, use the INFO client command. The INFO
# command shows only the memory used to store the redis data, not the extra
# memory used by the Windows process for its own requirements. Th3 extra amount
# of memory not reported by the INFO command can be calculated subtracting the
# Peak Working Set reported by the Windows Task Manager and the used_memory_peak
# reported by the INFO command.
#
maxmemory 2000mb
这里可以修改最大内存,建议放大点Redis比较还是吃内存的</span></pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p>连接Reids类的封装</p>
<p>Redis的C#驱动ServiceStack.Redis使用NuGet进行安装,由于这个类库已经商业化了,在4.0版本开始限制数量,每小时不得超过6000次,建议安装3.9版本</p>
<p>在这个爬虫系统中,开始时候我只使用一台电脑装Redis,后来发现这台电脑特别卡,后来换成三台电脑装Redis,一个负责hash表,一个负责UrlNext队列和Urltoken队列,一台负责User队列,由于实验室的电脑非常老旧,还是很卡。最后又加持2台电脑,实验室三台电脑负责hash表,我的电脑负责User队列,征用学妹电脑用作任务队列。</p>
<p>这个类命名为RedisCore</p>
<p>Ip地址列表</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre><span style="color: #0000ff">public</span> <span style="color: #0000ff">static</span> List&lt;<span style="color: #0000ff">string</span>&gt; ips = <span style="color: #0000ff">new</span> List&lt;<span style="color: #0000ff">string</span>&gt;<span style="color: #000000">()

        {

            </span><span style="color: #800000">"</span><span style="color: #800000">59.74.169.54</span><span style="color: #800000">"</span><span style="color: #000000">,

            </span><span style="color: #800000">"</span><span style="color: #800000">59.74.169.57</span><span style="color: #800000">"</span><span style="color: #000000">,

            </span><span style="color: #800000">"</span><span style="color: #800000">59.74.169.52</span><span style="color: #800000">"</span><span style="color: #000000">,

            </span><span style="color: #800000">"</span><span style="color: #800000">59.74.169.58</span><span style="color: #800000">"</span><span style="color: #000000">,

            </span><span style="color: #800000">"</span><span style="color: #800000">59.74.169.39</span><span style="color: #800000">"</span><span style="color: #000000">

        };</span></pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p>&nbsp;</p>
<p>对插入队列的封装。</p>
<p>Redis队列是有list这个数据结构实现的,从右边插入,左边弹出就可以实现队列</p>
<p>插入</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre><span style="color: #0000ff">public</span> <span style="color: #0000ff">static</span> <span style="color: #0000ff">bool</span> PushIntoList(<span style="color: #0000ff">int</span> type, <span style="color: #0000ff">string</span> key, <span style="color: #0000ff">string</span><span style="color: #000000"> value)

        {

            </span><span style="color: #0000ff">bool</span> Result = <span style="color: #0000ff">false</span><span style="color: #000000">;

            </span><span style="color: #0000ff">using</span> (RedisClient Redis = <span style="color: #0000ff">new</span> RedisClient(ips[type - <span style="color: #800080">1</span>], <span style="color: #800080">6379</span><span style="color: #000000">))

            {

                Redis.ConnectTimeout </span>= <span style="color: #800080">2000</span><span style="color: #000000">;

                Result </span>= Redis.RPush(key, Encoding.UTF8.GetBytes(value)) &gt; <span style="color: #800080">0</span><span style="color: #000000">;

            }

            </span><span style="color: #0000ff">return</span><span style="color: #000000"> Result;

        }</span></pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p>&nbsp;</p>
<p>注意这个非托管资源要手动释放</p>
<p>弹出</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre><span style="color: #0000ff">public</span> <span style="color: #0000ff">static</span> <span style="color: #0000ff">string</span> PopFromList(<span style="color: #0000ff">int</span> type, <span style="color: #0000ff">string</span><span style="color: #000000"> key)
        {
            </span><span style="color: #0000ff">string</span> result = <span style="color: #0000ff">string</span><span style="color: #000000">.Empty;
            </span><span style="color: #0000ff">try</span><span style="color: #000000">
            {
             
                </span><span style="color: #0000ff">using</span> (RedisClient Redis = <span style="color: #0000ff">new</span> RedisClient(ips[type - <span style="color: #800080">1</span>], <span style="color: #800080">6379</span><span style="color: #000000">))
                {
                    Redis.ConnectTimeout </span>= <span style="color: #800080">2000</span><span style="color: #000000">;
                    result </span>=<span style="color: #000000"> Encoding.UTF8.GetString(Redis.LPop(key));
                }
            
            }
            </span><span style="color: #0000ff">catch</span><span style="color: #000000">
            {
               
            }
            </span><span style="color: #0000ff">return</span><span style="color: #000000"> result;
        }</span></pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p>Hash表有三个电脑,到底放到那一台,首先对key进行hash运算,取绝对值,对3取余,为0 就放到3号机器,为1放到4号机器,为2 放到5号机器</p>
<p>,如果hash表已经存在就会插入失败返回false,不存在插入成功返回true</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre> <span style="color: #0000ff">public</span> <span style="color: #0000ff">static</span> <span style="color: #0000ff">bool</span> InsetIntoHash(<span style="color: #0000ff">int</span> type, <span style="color: #0000ff">string</span> hashid, <span style="color: #0000ff">string</span> key, <span style="color: #0000ff">string</span><span style="color: #000000"> value)
        {
            </span><span style="color: #0000ff">bool</span> result = <span style="color: #0000ff">false</span><span style="color: #000000">;
            </span><span style="color: #0000ff">try</span><span style="color: #000000">
            {
                </span><span style="color: #0000ff">using</span> (RedisClient Redis = <span style="color: #0000ff">new</span> RedisClient(ips[type - <span style="color: #800080">1</span>], <span style="color: #800080">6379</span><span style="color: #000000">))
                {
                    Redis.ConnectTimeout </span>= <span style="color: #800080">2000</span><span style="color: #000000">;
                    result </span>=<span style="color: #000000"> Redis.SetEntryInHashIfNotExists(hashid, key, value);
                }
            }
            </span><span style="color: #0000ff">catch</span><span style="color: #000000"> { }

            </span><span style="color: #0000ff">return</span><span style="color: #000000"> result;
        }
      </span></pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p>&nbsp;</p></div>

反爬虫问题

   抓取过于频繁,服务器返回429.这个时候需要切换代理IP了,推荐使用阿布云代理,阿布云代理IP,提供高匿代理,爬虫代理.

Keywords