你所需要的,不仅仅是一个好用的代理。
开源地址:http://git.oschina.net/dreamidea/neocrawler
一、概述
NEOCrawler(中文名:牛咖),是 nodejs、redis、phantomjs 实现的爬虫系统。代码完全开源,适合用于垂直领域的数据采集和爬虫二次开发。
可配置项:
1). 用正则表达式来描述,类似的网页归为一类,使用相同的规则。一个爬虫系统(下面几条指的都是某类网址可配置项);
2). 起始地址、抓取方式、存储位置、页面处理方式等;
3). 需要收集的链接规则,用 CSS 选择符限定爬虫只收集出现在页面中某个位置的链接;
3). 页面摘取规则,可以用 CSS 选择符、正则表达式来定位每个字段内容要抽取的位置;
4). 预定义要在页面打开后注入执行的 js 语句;
5). 网页预设的 cookie;
6). 评判该类网页返回是否正常的规则,通常是指定一些网页返回正常后页面必然存在的关键词让爬虫检测;
7). 评判数据摘取是否完整的规则,摘取字段中选取几个非常必要的字段作为摘取是否完整的评判标准;
8). 该类网页的调度权重(优先级)、周期(多久后重新抓取更新)。
提示:建议刚接触本系统的用户跳过架构介绍环节直接进入第二部分,先将系统运行起来,有一个感性认识后再来查阅架构的环节,如果您需要做深入的二次开发,请仔细阅读本环节资料
整体架构
图中黄色部分为爬虫系统的各个子系统。
二、运行步骤
create <span class="s1">'crawled'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"data"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'extra'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span> create <span class="s1">'crawled_bin'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"binary"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>
1
2
create <span class="s1">'crawled'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"data"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'extra'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>
create <span class="s1">'crawled_bin'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"binary"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>
推荐使用 hbase rest 方式, 当你启动 hbase 后, 在 hbase 目录的 bin 子目录下执行以下命令可以启动 hbase rest:
./hbase-daemon.sh start rest
1
./hbase-daemon.sh start rest
默认端口为 8080, 下面配置中会用到.
<span class="p">{</span> <span class="cm">/*注意:此处用于解释各项配置,真正的setting.json中不能包含注释*/</span> <span class="s2">"driller_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span><span class="cm">/*网址规则配置信息存储位置,最后一个数字表示redis的第几个数据库*/</span> <span class="s2">"url_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="cm">/*网址信息存储位置*/</span> <span class="s2">"url_report_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6380</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span><span class="cm">/*抓取错误信息存储位置*/</span> <span class="s2">"proxy_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span><span class="cm">/*http代理网址存储位置*/</span> <span class="s2">"use_proxy"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用代理服务*/</span> <span class="s2">"proxy_router"</span><span class="p">:</span><span class="s2">"127.0.0.1:2013"</span><span class="p">,</span><span class="cm">/*使用代理服务的情况下,代理服务的路由中心地址*/</span> <span class="s2">"download_timeout"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*下载超时时间,秒,不等同于相应超时*/</span> <span class="s2">"save_content_to_hbase"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否将抓取信息存储到hbase,目前只在0.94下测试过*/</span> <span class="s2">"crawled_hbase_conf"</span><span class="p">:[</span><span class="s2">"localhost"</span><span class="p">,</span><span class="mi">8080</span><span class="p">],</span><span class="cm">/*hbase rest的配置,你可以使用tcp方式连接,配置为{"zookeeperHosts": ["localhost:2181"],"zookeeperRoot": "/hbase"},此模式下有OOM Bug,不建议使用*/</span> <span class="s2">"crawled_hbase_table"</span><span class="p">:</span><span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*抓取的数据保存在hbase的表*/</span> <span class="s2">"crawled_hbase_bin_table"</span><span class="p">:</span><span class="s2">"crawled_bin"</span><span class="p">,</span><span class="cm">/*抓取的二进制数据保存在hbase的表*/</span> <span class="s2">"statistic_mysql_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">3306</span><span class="p">,</span><span class="s2">"crawling"</span><span class="p">,</span><span class="s2">"crawler"</span><span class="p">,</span><span class="s2">"123"</span><span class="p">],</span><span class="cm">/*用来存储抓取日志分析结果,需要结合flume来实现,一般不使用此项*/</span> <span class="s2">"check_driller_rules_interval"</span><span class="p">:</span><span class="mi">120</span><span class="p">,</span><span class="cm">/*多久检测一次网址规则的变化以便热刷新到运行中的爬虫*/</span> <span class="s2">"spider_concurrency"</span><span class="p">:</span><span class="mi">5</span><span class="p">,</span><span class="cm">/*爬虫的抓取页面并发请求数*/</span> <span class="s2">"spider_request_delay"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="cm">/*两个并发请求之间的间隔时间,秒*/</span> <span class="s2">"schedule_interval"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*调度器两次调度的间隔时间*/</span> <span class="s2">"schedule_quantity_limitation"</span><span class="p">:</span><span class="mi">200</span><span class="p">,</span><span class="cm">/*调度器给爬虫的最大网址待抓取数量*/</span> <span class="s2">"download_retry"</span><span class="p">:</span><span class="mi">3</span><span class="p">,</span><span class="cm">/*错误重试次数*/</span> <span class="s2">"log_level"</span><span class="p">:</span><span class="s2">"DEBUG"</span><span class="p">,</span><span class="cm">/*日志级别*/</span> <span class="s2">"use_ssdb"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用ssdb*/</span> <span class="s2">"to_much_fail_exit"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*错误太多的时候是否自动终止爬虫*/</span> <span class="s2">"keep_link_relation"</span><span class="p">:</span><span class="kc">false</span><span class="cm">/*链接库里是否存储链接间关系*/</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<span class="p">{</span>
<span class="cm">/*注意:此处用于解释各项配置,真正的setting.json中不能包含注释*/</span>
<span class="s2">"driller_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span><span class="cm">/*网址规则配置信息存储位置,最后一个数字表示redis的第几个数据库*/</span>
<span class="s2">"url_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="cm">/*网址信息存储位置*/</span>
<span class="s2">"url_report_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6380</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span><span class="cm">/*抓取错误信息存储位置*/</span>
<span class="s2">"proxy_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span><span class="cm">/*http代理网址存储位置*/</span>
<span class="s2">"use_proxy"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用代理服务*/</span>
<span class="s2">"proxy_router"</span><span class="p">:</span><span class="s2">"127.0.0.1:2013"</span><span class="p">,</span><span class="cm">/*使用代理服务的情况下,代理服务的路由中心地址*/</span>
<span class="s2">"download_timeout"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*下载超时时间,秒,不等同于相应超时*/</span>
<span class="s2">"save_content_to_hbase"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否将抓取信息存储到hbase,目前只在0.94下测试过*/</span>
<span class="s2">"crawled_hbase_conf"</span><span class="p">:[</span><span class="s2">"localhost"</span><span class="p">,</span><span class="mi">8080</span><span class="p">],</span><span class="cm">/*hbase rest的配置,你可以使用tcp方式连接,配置为{"zookeeperHosts": ["localhost:2181"],"zookeeperRoot": "/hbase"},此模式下有OOM Bug,不建议使用*/</span>
<span class="s2">"crawled_hbase_table"</span><span class="p">:</span><span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*抓取的数据保存在hbase的表*/</span>
<span class="s2">"crawled_hbase_bin_table"</span><span class="p">:</span><span class="s2">"crawled_bin"</span><span class="p">,</span><span class="cm">/*抓取的二进制数据保存在hbase的表*/</span>
<span class="s2">"statistic_mysql_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">3306</span><span class="p">,</span><span class="s2">"crawling"</span><span class="p">,</span><span class="s2">"crawler"</span><span class="p">,</span><span class="s2">"123"</span><span class="p">],</span><span class="cm">/*用来存储抓取日志分析结果,需要结合flume来实现,一般不使用此项*/</span>
<span class="s2">"check_driller_rules_interval"</span><span class="p">:</span><span class="mi">120</span><span class="p">,</span><span class="cm">/*多久检测一次网址规则的变化以便热刷新到运行中的爬虫*/</span>
<span class="s2">"spider_concurrency"</span><span class="p">:</span><span class="mi">5</span><span class="p">,</span><span class="cm">/*爬虫的抓取页面并发请求数*/</span>
<span class="s2">"spider_request_delay"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="cm">/*两个并发请求之间的间隔时间,秒*/</span>
<span class="s2">"schedule_interval"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*调度器两次调度的间隔时间*/</span>
<span class="s2">"schedule_quantity_limitation"</span><span class="p">:</span><span class="mi">200</span><span class="p">,</span><span class="cm">/*调度器给爬虫的最大网址待抓取数量*/</span>
<span class="s2">"download_retry"</span><span class="p">:</span><span class="mi">3</span><span class="p">,</span><span class="cm">/*错误重试次数*/</span>
<span class="s2">"log_level"</span><span class="p">:</span><span class="s2">"DEBUG"</span><span class="p">,</span><span class="cm">/*日志级别*/</span>
<span class="s2">"use_ssdb"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用ssdb*/</span>
<span class="s2">"to_much_fail_exit"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*错误太多的时候是否自动终止爬虫*/</span>
<span class="s2">"keep_link_relation"</span><span class="p">:</span><span class="kc">false</span><span class="cm">/*链接库里是否存储链接间关系*/</span>
<span class="p">}</span>
以下是具体的启动命令
node run.js -i abc -a config -p 8888
在浏览器打开 http://localhost:8888 可以在 web 界面配置抓取规则
node run.js -i abc -a test -l “http://domain/page/“
node run.js -i abc -a schedule
-i 指定了实例名,-a 指定了动作 schedule,下同
此处的 – p 指定代理路由的端口,如果在本机运行,setting.json 的 proxy_router 及端口为 127.0.0.1:2013
可以在 instance/example/logs 下查看输出日志 debug-result.json
在正式运行的环境下建议使用 nodejs 的 pm2 或者 python 的 supervisor 来托管进程.
打开 web 界面, 例如:http://localhost:8888/ , 进入 “Drilling Rules”,添加规则。这是一个 json 编辑器,可以在代码模式 / 可视化模式之间切换。下面给出配置项的说明. 具体的应用配置可以参考下一章节的示例.
<span class="p">{</span> <span class="cm">/*注意:此处用于解释各项配置,真正的配置代码中不能包含注释*/</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*顶级域名,例如163.com(不带主机名,www.163.com是错误的)*/</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*网址规则,正则表达式,例如:^http://domain/\d+\.html,限定范围越精确越好*/</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*给该规则取的别名*/</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*该网址可以带的有效参数,如果数组第一个值为#,表示过滤一切参数*/</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span><span class="cm">/*页面编码,auto表示自动检测,可以填写具体值:gbk,utf-8*/</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span><span class="cm">/*页面类型,分支branch或者节点node*/</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否保存html源代码*/</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span><span class="cm">/*页面形式,html/json/binary*/</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*是否需要处理js,决定了爬虫是否用phantomjs加载页面*/</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*摘取规则,后面单独详述*/</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*如果不摘取数据,rule应该为空*/</span> <span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,后面单独详述*/</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">}</span> <span class="p">}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*cookie值,有多个object组成,每个object是一个cookie值*/</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否注入jquery*/</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否载入图片*/</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"a"</span><span class="p">],</span><span class="cm">/*页面中感兴趣的链接,填写css选择符选择a元素,可以为多个,此处表示所有链接*/</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,从页面中摘取一个值来填充上下文关系对此页的描述*/</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*验证页面下载是否有效的关键词,可以为多个,为空表示不验证*/</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*在页面中执行的脚本,可以为多个,依次对应每个层级下的执行。以js_result=..形式*/</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*自动导航,css选择符,可以为多个,依次对应每个层级,phantomjs将点击匹配的元素进行导航*/</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="cm">/*导航几个层级后停止*/</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span><span class="cm">/*调度优先级,数字越小越优先*/</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span><span class="cm">/*调度权重,数字越大越有限*/</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span><span class="cm">/*重新调度的周期,单位秒*/</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否激活该规则*/</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*种子地址,重新调度时从这些网址开始*/</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="cm">/*调度方式,FIFO或者LIFO*/</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<span class="p">{</span>
<span class="cm">/*注意:此处用于解释各项配置,真正的配置代码中不能包含注释*/</span>
<span class="s2">"domain"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*顶级域名,例如163.com(不带主机名,www.163.com是错误的)*/</span>
<span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*网址规则,正则表达式,例如:^http://domain/\d+\.html,限定范围越精确越好*/</span>
<span class="s2">"alias"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*给该规则取的别名*/</span>
<span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*该网址可以带的有效参数,如果数组第一个值为#,表示过滤一切参数*/</span>
<span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span><span class="cm">/*页面编码,auto表示自动检测,可以填写具体值:gbk,utf-8*/</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span><span class="cm">/*页面类型,分支branch或者节点node*/</span>
<span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否保存html源代码*/</span>
<span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span><span class="cm">/*页面形式,html/json/binary*/</span>
<span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*是否需要处理js,决定了爬虫是否用phantomjs加载页面*/</span>
<span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*摘取规则,后面单独详述*/</span>
<span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>
<span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*如果不摘取数据,rule应该为空*/</span>
<span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,后面单独详述*/</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*cookie值,有多个object组成,每个object是一个cookie值*/</span>
<span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否注入jquery*/</span>
<span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否载入图片*/</span>
<span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"a"</span><span class="p">],</span><span class="cm">/*页面中感兴趣的链接,填写css选择符选择a元素,可以为多个,此处表示所有链接*/</span>
<span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,从页面中摘取一个值来填充上下文关系对此页的描述*/</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*验证页面下载是否有效的关键词,可以为多个,为空表示不验证*/</span>
<span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*在页面中执行的脚本,可以为多个,依次对应每个层级下的执行。以js_result=..形式*/</span>
<span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*自动导航,css选择符,可以为多个,依次对应每个层级,phantomjs将点击匹配的元素进行导航*/</span>
<span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="cm">/*导航几个层级后停止*/</span>
<span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span><span class="cm">/*调度优先级,数字越小越优先*/</span>
<span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span><span class="cm">/*调度权重,数字越大越有限*/</span>
<span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span><span class="cm">/*重新调度的周期,单位秒*/</span>
<span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否激活该规则*/</span>
<span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*种子地址,重新调度时从这些网址开始*/</span>
<span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="cm">/*调度方式,FIFO或者LIFO*/</span>
<span class="p">}</span>
<span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span><span class="cm">/*基于什么摘取,网页DOM:content或者给予url*/</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span><span class="cm">/*摘取模式,css或者regex表示css选择符或者正则表达式,value表示给固定值*/</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span><span class="cm">/*表达式,与mode相对应css选择符表达式或者正则表达式,或者一个固定的值*/</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span><span class="cm">/*css模式下摘取一个元素的属性或者值,text、html表示文本值或者标签代码,@href表示href属性值,其他属性依次类推在前面加@符号*/</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="cm">/*当有多个元素时,选取第几个元素,-1表示选择多个,将返回数组值*/</span> <span class="p">}</span>
1
2
3
4
5
6
7
<span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span><span class="cm">/*基于什么摘取,网页DOM:content或者给予url*/</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span><span class="cm">/*摘取模式,css或者regex表示css选择符或者正则表达式,value表示给固定值*/</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span><span class="cm">/*表达式,与mode相对应css选择符表达式或者正则表达式,或者一个固定的值*/</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span><span class="cm">/*css模式下摘取一个元素的属性或者值,text、html表示文本值或者标签代码,@href表示href属性值,其他属性依次类推在前面加@符号*/</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="cm">/*当有多个元素时,选取第几个元素,-1表示选择多个,将返回数组值*/</span>
<span class="p">}</span>
<span class="cm">/*摘取规则由多个摘取单元构成,它们之间的基本结构如下*/</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*该节点存储到hbase的表名称*/</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*具体规则*/</span> <span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,规则参考上面的说明*/</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">"subset"</span><span class="p">:{</span><span class="cm">/*子集*/</span> <span class="s2">"category"</span><span class="p">:</span><span class="s2">"comment"</span><span class="p">,</span><span class="cm">/*属于comment(存储到comment)*/</span> <span class="s2">"relate"</span><span class="p">:</span><span class="s2">"#title#"</span><span class="p">,</span><span class="cm">/*与上级关联*/</span> <span class="s2">"mapping"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*子集类型,mapping为true将分到另外的表中单独存储*/</span> <span class="s2">"rule"</span><span class="p">:{</span> <span class="s2">"profile"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@href"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="cm">/*摘取单元*/</span> <span class="s2">"message"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@alt"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">}</span> <span class="p">},</span> <span class="s2">"require"</span><span class="p">:[</span><span class="s2">"profile"</span><span class="p">]</span><span class="cm">/*必须字段*/</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> <span class="s2">"require"</span><span class="p">:[</span><span class="s2">"title"</span><span class="p">]</span><span class="cm">/*必须字段,如果里面的值为数组,表示这个数组内的值有任意一个就满足要求,例如[[a,b],c]*/</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<span class="cm">/*摘取规则由多个摘取单元构成,它们之间的基本结构如下*/</span>
<span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*该节点存储到hbase的表名称*/</span>
<span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*具体规则*/</span>
<span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,规则参考上面的说明*/</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s2">"subset"</span><span class="p">:{</span><span class="cm">/*子集*/</span>
<span class="s2">"category"</span><span class="p">:</span><span class="s2">"comment"</span><span class="p">,</span><span class="cm">/*属于comment(存储到comment)*/</span>
<span class="s2">"relate"</span><span class="p">:</span><span class="s2">"#title#"</span><span class="p">,</span><span class="cm">/*与上级关联*/</span>
<span class="s2">"mapping"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*子集类型,mapping为true将分到另外的表中单独存储*/</span>
<span class="s2">"rule"</span><span class="p">:{</span>
<span class="s2">"profile"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@href"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="cm">/*摘取单元*/</span>
<span class="s2">"message"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@alt"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">}</span>
<span class="p">},</span>
<span class="s2">"require"</span><span class="p">:[</span><span class="s2">"profile"</span><span class="p">]</span><span class="cm">/*必须字段*/</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="s2">"require"</span><span class="p">:[</span><span class="s2">"title"</span><span class="p">]</span><span class="cm">/*必须字段,如果里面的值为数组,表示这个数组内的值有任意一个就满足要求,例如[[a,b],c]*/</span>
<span class="p">}</span>
三、简单示例
此步骤假设你已经将 web 配置后台运行起来了, 如何运行 web 配置请按照上一章节的说明
下面列出一个抓取微信号的配置例子. 假设我们的意图是抓取 http://www.sovxin.com 上的所有微信号.
第一步是观察网站的结构, 大概可以分为 4 个层次: 首页, 分类频道页, 列表页, 详情页. 我们根据这个页面层次来进行抓取规则的配置, 其中, 只有详情页是需要配置字段摘取信息的, 其他 3 种页面都是用来逐步发现详情页的. 我们将这个顺序倒过来, 从详情页开始配置, 最后再看首页.
你应当对照上一章节讲到的每个配置项的说明来理解这个示例
可以将编辑器切换到代码模式, 将下面的 json 粘贴到里面.
<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/weixin_\\d+.html$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"detail"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"nickname"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"name"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">微信号:(.*?)</td>"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"subtype"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">账号类型:(.*?)</td>"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"location"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".js_other>._o_left>a"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"description"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".introduction"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"logo"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".avatar>img"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"qrcode"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".erweima"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"class"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(2)"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"subclass"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(3)"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">}</span> <span class="p">},</span> <span class="s2">"require"</span><span class="p">:</span> <span class="p">[</span> <span class="p">[</span> <span class="s2">"name"</span><span class="p">,</span> <span class="s2">"oid"</span><span class="p">,</span> <span class="s2">"nickname"</span> <span class="p">]</span> <span class="p">]</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span><span class="p">,</span> <span class="s2">".avatar>img"</span><span class="p">,</span> <span class="s2">".erweima"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"当前位置"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">8640000</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1408456940902</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
<span class="p">{</span>
<span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>
<span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/weixin_\\d+.html$"</span><span class="p">,</span>
<span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"detail"</span><span class="p">,</span>
<span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"#"</span>
<span class="p">],</span>
<span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span>
<span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>
<span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>
<span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"nickname"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"name"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">微信号:(.*?)</td>"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"subtype"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">账号类型:(.*?)</td>"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"location"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".js_other>._o_left>a"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"description"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".introduction"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"logo"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".avatar>img"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"qrcode"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".erweima"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"class"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(2)"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"subclass"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(3)"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="s2">"require"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">[</span>
<span class="s2">"name"</span><span class="p">,</span>
<span class="s2">"oid"</span><span class="p">,</span>
<span class="s2">"nickname"</span>
<span class="p">]</span>
<span class="p">]</span>
<span class="p">},</span>
<span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"a"</span><span class="p">,</span>
<span class="s2">".avatar>img"</span><span class="p">,</span>
<span class="s2">".erweima"</span>
<span class="p">],</span>
<span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"当前位置"</span>
<span class="p">],</span>
<span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>
<span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
<span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">8640000</span><span class="p">,</span>
<span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
<span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>
<span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1408456940902</span>
<span class="p">}</span>
<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/t_.*?.html$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"list"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"当前位置"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"http://www.sovxin.com/t_xiuxianyule_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_jiankangshenghuo_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_wenhuajiaoyu_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_jiaoliu_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_qiyepinpai_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_mingxingmingren_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_youguanbumen_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_zonghe_#.html#1#300#1"</span> <span class="p">],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
<span class="p">{</span>
<span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>
<span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/t_.*?.html$"</span><span class="p">,</span>
<span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"list"</span><span class="p">,</span>
<span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"#"</span>
<span class="p">],</span>
<span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span>
<span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>
<span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>
<span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span>
<span class="p">},</span>
<span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"a"</span>
<span class="p">],</span>
<span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"当前位置"</span>
<span class="p">],</span>
<span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>
<span class="s2">"priority"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
<span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
<span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span>
<span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
<span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"http://www.sovxin.com/t_xiuxianyule_#.html#1#300#1"</span><span class="p">,</span>
<span class="s2">"http://www.sovxin.com/t_jiankangshenghuo_#.html#1#300#1"</span><span class="p">,</span>
<span class="s2">"http://www.sovxin.com/t_wenhuajiaoyu_#.html#1#300#1"</span><span class="p">,</span>
<span class="s2">"http://www.sovxin.com/t_jiaoliu_#.html#1#300#1"</span><span class="p">,</span>
<span class="s2">"http://www.sovxin.com/t_qiyepinpai_#.html#1#300#1"</span><span class="p">,</span>
<span class="s2">"http://www.sovxin.com/t_mingxingmingren_#.html#1#300#1"</span><span class="p">,</span>
<span class="s2">"http://www.sovxin.com/t_youguanbumen_#.html#1#300#1"</span><span class="p">,</span>
<span class="s2">"http://www.sovxin.com/t_zonghe_#.html#1#300#1"</span>
<span class="p">],</span>
<span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>
<span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span>
<span class="p">}</span>
<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/fenlei_.*?.html$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"category"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"当前位置"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"http://www.sovxin.com/fenlei_zixun.html"</span> <span class="p">],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
<span class="p">{</span>
<span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>
<span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/fenlei_.*?.html$"</span><span class="p">,</span>
<span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"category"</span><span class="p">,</span>
<span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"#"</span>
<span class="p">],</span>
<span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span>
<span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>
<span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>
<span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span>
<span class="p">},</span>
<span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"a"</span>
<span class="p">],</span>
<span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"当前位置"</span>
<span class="p">],</span>
<span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>
<span class="s2">"priority"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
<span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span>
<span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
<span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"http://www.sovxin.com/fenlei_zixun.html"</span>
<span class="p">],</span>
<span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>
<span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span>
<span class="p">}</span>
<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"home"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"搜微信"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"http://www.sovxin.com/"</span> <span class="p">],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
<span class="p">{</span>
<span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>
<span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/$"</span><span class="p">,</span>
<span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"home"</span><span class="p">,</span>
<span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"#"</span>
<span class="p">],</span>
<span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span>
<span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>
<span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>
<span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span>
<span class="p">},</span>
<span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"a"</span>
<span class="p">],</span>
<span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>
<span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>
<span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>
<span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>
<span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>
<span class="p">},</span>
<span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"搜微信"</span>
<span class="p">],</span>
<span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>
<span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>
<span class="s2">"priority"</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span>
<span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
<span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span>
<span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
<span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"http://www.sovxin.com/"</span>
<span class="p">],</span>
<span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>
<span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span>
<span class="p">}</span>
四、进阶示例
抓取的数据默认是存储到 hbase, 你可以可以将这种默认行为取消, 将数据存储到其他类型的数据库. 修改 instance / 你的实例 / settings.json, 将 save_content_to_hbase 设置为 false. 然后修改 instance / 你的实例 / spider_extend.js, 这里是你定制化开发的地方, 将 pipeline 方法的注释拿掉, 爬虫抓完页面后会调用该函数, 传入一个 extracted_info 是摘取后的结构化数据, 另一个参数 callback 是回调函数要求你在做完你做的事情 (实际上就是存数据到你的数据库).extracted_info 的结构你可以 console.dir(extracted_info) 或者 Webstorm IDE 内断点调试以下就能看到. 以下代码 (存储到 mongodb) 仅供参考
<span class="cm">/** * instead of main framework content pipeline * if it do nothing , comment it * @param extracted_info (same to extract) */</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">pipeline</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span> <span class="kd">var</span> <span class="nx">spider_extend</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">]</span><span class="o">||</span><span class="nx">isEmpty</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">])){</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="s1">'data of '</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is empty.'</span><span class="p">);</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">}</span><span class="k">else</span><span class="p">{</span> <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">];</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">trim</span><span class="p">()</span><span class="o">!=</span><span class="s2">""</span><span class="p">){</span> <span class="kd">var</span> <span class="nx">_id</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">puerContent</span> <span class="o">=</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">[^\u</span><span class="sr">4e00-</span><span class="se">\u</span><span class="sr">9fa5a-z0-9</span><span class="se">]</span><span class="sr">/ig</span><span class="p">,</span><span class="s1">''</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">simplefp</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">puerContent</span><span class="p">).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">currentTime</span> <span class="o">=</span> <span class="p">(</span><span class="k">new</span> <span class="nb">Date</span><span class="p">()).</span><span class="nx">getTime</span><span class="p">();</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'updated'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'published'</span><span class="p">]</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span> <span class="c1">//drop additional info</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">];</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">];</span> <span class="c1">//format relation to array</span> <span class="k">if</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">]){</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'relation'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">'->'</span><span class="p">);</span> <span class="p">}</span> <span class="c1">//get domain</span> <span class="kd">var</span> <span class="nx">urlibarr</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">][</span><span class="s1">'urllib'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">':'</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">urlibarr</span><span class="p">[</span><span class="nx">urlibarr</span><span class="p">.</span><span class="nx">length</span><span class="o">-</span><span class="mi">2</span><span class="p">];</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'domain'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">domain</span><span class="p">;</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'get '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' from '</span><span class="o">+</span><span class="nx">domain</span><span class="o">+</span><span class="s1">'('</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">')'</span><span class="p">);</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">];</span> <span class="kd">var</span> <span class="nx">query</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">"$or"</span><span class="p">:[</span> <span class="p">{</span> <span class="s1">'_id'</span><span class="p">:</span><span class="nx">_id</span> <span class="p">},</span> <span class="p">{</span> <span class="s1">'simplefp'</span><span class="p">:</span><span class="nx">simplefp</span> <span class="p">}</span> <span class="p">]</span> <span class="p">};</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">findOne</span><span class="p">(</span><span class="nx">query</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">item</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="nx">err</span><span class="p">){</span><span class="k">throw</span> <span class="nx">err</span><span class="p">;</span><span class="nx">callback</span><span class="p">();}</span> <span class="k">else</span><span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="nx">item</span><span class="p">){</span> <span class="c1">//if the new data of field less than the old, drop it</span> <span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">nlist</span><span class="p">){</span> <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">c</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="nx">c</span><span class="o"><</span><span class="nx">nlist</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span><span class="nx">c</span><span class="o">++</span><span class="p">)</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="o"><</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="p">)</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]];</span> <span class="p">})([</span><span class="s1">'title'</span><span class="p">,</span><span class="s1">'article'</span><span class="p">,</span><span class="s1">'tags'</span><span class="p">,</span><span class="s1">'keywords'</span><span class="p">]);</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">update</span><span class="p">({</span><span class="s1">'_id'</span><span class="p">:</span><span class="nx">item</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]},{</span><span class="na">$set</span><span class="p">:</span><span class="nx">data</span><span class="p">},</span> <span class="p">{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span><span class="nx">result</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">)</span> <span class="p">{</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'update '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' to mongodb, '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' --override-> '</span> <span class="o">+</span> <span class="nx">item</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="p">}</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">});</span> <span class="p">}</span><span class="k">else</span><span class="p">{</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'simplefp'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">simplefp</span><span class="p">;</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">_id</span><span class="p">;</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'created'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">insert</span><span class="p">(</span><span class="nx">data</span><span class="p">,{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">result</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">){</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'insert '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' to mongodb'</span><span class="p">);</span> <span class="p">}</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">});</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span><span class="k">else</span><span class="p">{</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is lack of content, drop it'</span><span class="p">);</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
<span class="cm">/**
* instead of main framework content pipeline
* if it do nothing , comment it
* @param extracted_info (same to extract)
*/</span>
<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">pipeline</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span>
<span class="kd">var</span> <span class="nx">spider_extend</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">]</span><span class="o">||</span><span class="nx">isEmpty</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">])){</span>
<span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="s1">'data of '</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is empty.'</span><span class="p">);</span>
<span class="nx">callback</span><span class="p">();</span>
<span class="p">}</span><span class="k">else</span><span class="p">{</span>
<span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">];</span>
<span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">trim</span><span class="p">()</span><span class="o">!=</span><span class="s2">""</span><span class="p">){</span>
<span class="kd">var</span> <span class="nx">_id</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">puerContent</span> <span class="o">=</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">[^\u</span><span class="sr">4e00-</span><span class="se">\u</span><span class="sr">9fa5a-z0-9</span><span class="se">]</span><span class="sr">/ig</span><span class="p">,</span><span class="s1">''</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">simplefp</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">puerContent</span><span class="p">).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">currentTime</span> <span class="o">=</span> <span class="p">(</span><span class="k">new</span> <span class="nb">Date</span><span class="p">()).</span><span class="nx">getTime</span><span class="p">();</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'updated'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'published'</span><span class="p">]</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span>
<span class="c1">//drop additional info</span>
<span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">];</span>
<span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">];</span>
<span class="c1">//format relation to array</span>
<span class="k">if</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">]){</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'relation'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">'->'</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">//get domain</span>
<span class="kd">var</span> <span class="nx">urlibarr</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">][</span><span class="s1">'urllib'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">':'</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">urlibarr</span><span class="p">[</span><span class="nx">urlibarr</span><span class="p">.</span><span class="nx">length</span><span class="o">-</span><span class="mi">2</span><span class="p">];</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'domain'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">domain</span><span class="p">;</span>
<span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'get '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' from '</span><span class="o">+</span><span class="nx">domain</span><span class="o">+</span><span class="s1">'('</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">')'</span><span class="p">);</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">];</span>
<span class="kd">var</span> <span class="nx">query</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"$or"</span><span class="p">:[</span>
<span class="p">{</span>
<span class="s1">'_id'</span><span class="p">:</span><span class="nx">_id</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="s1">'simplefp'</span><span class="p">:</span><span class="nx">simplefp</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">};</span>
<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">findOne</span><span class="p">(</span><span class="nx">query</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">item</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span><span class="p">(</span><span class="nx">err</span><span class="p">){</span><span class="k">throw</span> <span class="nx">err</span><span class="p">;</span><span class="nx">callback</span><span class="p">();}</span>
<span class="k">else</span><span class="p">{</span>
<span class="k">if</span><span class="p">(</span><span class="nx">item</span><span class="p">){</span>
<span class="c1">//if the new data of field less than the old, drop it</span>
<span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">nlist</span><span class="p">){</span>
<span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">c</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="nx">c</span><span class="o"><</span><span class="nx">nlist</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span><span class="nx">c</span><span class="o">++</span><span class="p">)</span>
<span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="o"><</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="p">)</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]];</span>
<span class="p">})([</span><span class="s1">'title'</span><span class="p">,</span><span class="s1">'article'</span><span class="p">,</span><span class="s1">'tags'</span><span class="p">,</span><span class="s1">'keywords'</span><span class="p">]);</span>
<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">update</span><span class="p">({</span><span class="s1">'_id'</span><span class="p">:</span><span class="nx">item</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]},{</span><span class="na">$set</span><span class="p">:</span><span class="nx">data</span><span class="p">},</span> <span class="p">{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span><span class="nx">result</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span>
<span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'update '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' to mongodb, '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' --override-> '</span> <span class="o">+</span> <span class="nx">item</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>
<span class="p">}</span>
<span class="nx">callback</span><span class="p">();</span>
<span class="p">});</span>
<span class="p">}</span><span class="k">else</span><span class="p">{</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'simplefp'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">simplefp</span><span class="p">;</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">_id</span><span class="p">;</span>
<span class="nx">data</span><span class="p">[</span><span class="s1">'created'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span>
<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">insert</span><span class="p">(</span><span class="nx">data</span><span class="p">,{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">result</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">){</span>
<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span>
<span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'insert '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' to mongodb'</span><span class="p">);</span>
<span class="p">}</span>
<span class="nx">callback</span><span class="p">();</span>
<span class="p">});</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">});</span>
<span class="p">}</span><span class="k">else</span><span class="p">{</span>
<span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is lack of content, drop it'</span><span class="p">);</span>
<span class="nx">callback</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
修改 instance / 你的实例 / settings.json 中的 spider_concurrency. 注意: 这里配置的是爬虫的并发请求数, 每种网页的重复抓取周期是在规则配置界面设置的.
有时候通过 web 界面配置的规则并不能满足一些特殊的抓取需求, 比如说一个页面抓取下来以后你要发起一个 ajax 子请求合并数据. 又比如说你要用自己的方法去摘取链接和内容. 将 instance / 你的实例 / spider_extend.js 中的 extract 方法去掉, 爬虫用内容的方法摘取完内容后会调用该函数, 传入两个参数, extracted_info 是抓取的信息, 包含了摘取到的数据, callback 是要求你完成你的动作后回调的函数, extracted_info 的结构你可以 console.dir(extracted_info) 或者 Webstorm IDE 内断点调试以下就能看到. 最后你必须调用回调函数 callback, 并且将摘取信息作为参数, 摘取信息的结构必须和传入的 extracted_info 一致, 实际上建议你直接在 extracted_info 上改动, 将其作为参数返回. 以下代码仅供参考:
<span class="cm">/** * DIY extract, it happens after spider framework extracted data. * @param extracted_info * { "signal":CMD_SIGNAL_CRAWL_SUCCESS, "content":'...', "remote_proxy":'...', "cost":122, "inject_jquery":true, "js_result":[], "drill_link":{"urllib_alias":[]}, "drill_count":0, "cookie":[], "url":'', "status":200, "origin":{ "url":link, "type":'branch/node', "referer":'', "url_pattern":'...', "save_page":true, "cookie":[], "jshandle":true, "inject_jquery":true, "drill_rules":[], "script":[], "navigate_rule":[], "stoppage":-1, "start_time":1234 } }; * @returns callback({*}) */</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">extract</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span> <span class="kd">var</span> <span class="nx">self</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span> <span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">__getTopLevelDomain</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="kd">var</span> <span class="nx">result</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">;</span> <span class="k">switch</span><span class="p">(</span><span class="nx">domain</span><span class="p">){</span> <span class="k">case</span> <span class="s1">'sino-manager.com'</span><span class="p">:</span> <span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">].</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">(</span><span class="sr">.</span><span class="se">{31})</span><span class="sr">/</span><span class="p">,</span><span class="s2">"$1s"</span><span class="p">);</span> <span class="p">}</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="k">case</span> <span class="s1">'chinaventure.com.cn'</span><span class="p">:</span> <span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">)</span> <span class="p">{</span> <span class="kd">var</span> <span class="nx">content</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">substring</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">length</span><span class="o">-</span><span class="mi">1</span><span class="p">));</span> <span class="kd">var</span> <span class="nx">news_url</span> <span class="o">=</span> <span class="s1">''</span><span class="p">;</span> <span class="kd">var</span> <span class="nx">detail</span> <span class="o">=</span> <span class="p">[];</span> <span class="kd">var</span> <span class="nx">list</span> <span class="o">=</span> <span class="p">[];</span> <span class="kd">var</span> <span class="nx">pages</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">content</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="nx">detail</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">content</span><span class="p">[</span><span class="nx">i</span><span class="p">].</span><span class="nx">news_url</span><span class="p">);</span> <span class="p">}</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventuredetail'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">detail</span><span class="p">;</span> <span class="kd">var</span> <span class="nx">expression</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">RegExp</span><span class="p">(</span><span class="s1">'^.*pages=([0-9]+).*$'</span><span class="p">,</span><span class="s2">"ig"</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">matched</span> <span class="o">=</span> <span class="nx">expression</span><span class="p">.</span><span class="nx">exec</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="k">if</span> <span class="p">(</span><span class="nx">matched</span><span class="p">)</span> <span class="p">{</span> <span class="nx">pages</span> <span class="o">=</span> <span class="nb">parseInt</span><span class="p">(</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">pages</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="p">}</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="nx">list</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">list</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="nl">default</span><span class="p">:;</span> <span class="p">}</span> <span class="k">return</span> <span class="nx">callback</span><span class="p">(</span><span class="nx">result</span><span class="p">);</span> <span class="p">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
<span class="cm">/**
* DIY extract, it happens after spider framework extracted data.
* @param extracted_info
* {
"signal":CMD_SIGNAL_CRAWL_SUCCESS,
"content":'...',
"remote_proxy":'...',
"cost":122,
"inject_jquery":true,
"js_result":[],
"drill_link":{"urllib_alias":[]},
"drill_count":0,
"cookie":[],
"url":'',
"status":200,
"origin":{
"url":link,
"type":'branch/node',
"referer":'',
"url_pattern":'...',
"save_page":true,
"cookie":[],
"jshandle":true,
"inject_jquery":true,
"drill_rules":[],
"script":[],
"navigate_rule":[],
"stoppage":-1,
"start_time":1234
}
};
* @returns callback({*})
*/</span>
<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">extract</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span>
<span class="kd">var</span> <span class="nx">self</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">__getTopLevelDomain</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>
<span class="kd">var</span> <span class="nx">result</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">;</span>
<span class="k">switch</span><span class="p">(</span><span class="nx">domain</span><span class="p">){</span>
<span class="k">case</span> <span class="s1">'sino-manager.com'</span><span class="p">:</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">].</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">(</span><span class="sr">.</span><span class="se">{31})</span><span class="sr">/</span><span class="p">,</span><span class="s2">"$1s"</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">case</span> <span class="s1">'chinaventure.com.cn'</span><span class="p">:</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">content</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">substring</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">length</span><span class="o">-</span><span class="mi">1</span><span class="p">));</span>
<span class="kd">var</span> <span class="nx">news_url</span> <span class="o">=</span> <span class="s1">''</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">detail</span> <span class="o">=</span> <span class="p">[];</span>
<span class="kd">var</span> <span class="nx">list</span> <span class="o">=</span> <span class="p">[];</span>
<span class="kd">var</span> <span class="nx">pages</span><span class="p">;</span>
<span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">content</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">detail</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">content</span><span class="p">[</span><span class="nx">i</span><span class="p">].</span><span class="nx">news_url</span><span class="p">);</span>
<span class="p">}</span>
<span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventuredetail'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">detail</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">expression</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">RegExp</span><span class="p">(</span><span class="s1">'^.*pages=([0-9]+).*$'</span><span class="p">,</span><span class="s2">"ig"</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">matched</span> <span class="o">=</span> <span class="nx">expression</span><span class="p">.</span><span class="nx">exec</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">matched</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">pages</span> <span class="o">=</span> <span class="nb">parseInt</span><span class="p">(</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span>
<span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">pages</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="p">}</span>
<span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>
<span class="nx">list</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>
<span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">list</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="nl">default</span><span class="p">:;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nx">callback</span><span class="p">(</span><span class="nx">result</span><span class="p">);</span>
<span class="p">}</span>
五、Redis/ssdb 数据结构
理解数据结构, 有助于你熟悉整套系统进行二次开发. neocrawler 用到 4 个存储空间, driller_info_redis_db, url_info_redis_db, url_report_redis_db, proxy_info_redis_db, 可以在实例下的 settings.json 配置, 4 个空间存储的类别不同, 键名不会冲突, 可以将 4 个空间指向一个 redis/ssdb 库, 每个空间的增长量不一样, 如果使用 redis 建议将每个空间指向一个 db, 有条件的情况下一个空间一个 redis, 下面分别对 4 个空间的结构进行介绍:
存储了抓取规则及网址
例如: driller:163.com:newslist, 大括号表示变量, 下同. hash 类型, 存储了抓取规则, 在 web 界面配置的规则存储在这里.
例如: urllib:driller:163.com:newslist. list 类型, 存储了某种规则的网址队列, 爬虫发现符合抓取规则的网址时, 将其存入相应的队列, 调度器将从这些队列里摘取网址进行调度, 爬虫依据调度队列进行抓取, 整个过程循环反复.
待抓取队列, list 类型, 同一时间存在很多 urllib(参照上面一个说明), 调度器会根据爬虫的总调度限制及 queue:scheduled:all 队列长度得出当前可追加队列长度, 再根据你在 web 配置中心配置的调度权重 (priority, weight) 从每个队列中抽取相应网址放入 queue:scheduled:all, 爬虫将从 queue:scheduled:all 摘取网址进行抓取.
记录抓取规则配置的版本信息. 爬虫 / 调度器对爬虫规则的变更是热感应 (实时刷新) 的, 但是不可能每次调度 (一个周期大概间隔几秒) 都将所有规则扫描重新载入, 于是就采用版本记录的方式, web 配置中心更改抓取规则后变更版本信息, 爬虫会重复检测这个键, 如果发现版本变化则重新加载抓取规则.
该空间存储了网址信息, 抓取运行时间越长这里的数据量会越大
例如: 9108d6a10bd476158144186138fe0ba8, hash 类型, 记录了一个网址的详细信息, 在哪个页面被发现, 当前状态, 爬虫系统对该网址的操作轨迹 (发现 – 调度 – 抓取 – 存储 / 失败等等) 以及最后操作时间. 这些记录是调度器对二次发现网址是是否调度抓取的依据.
该空间存储爬虫抓取报告
例如: fail:urllib:driller:163.com:newslist, zset 类型, 记录了抓取失败的网址.
例如: stuck:urllib:driller:163.com:newslist, zset 类型, 记录了存储 (hbase) 失败的网址.
抓取失败 / 存储失败的网址可以用 tools/queue-helper.js 添加到抓取队列重新抓取.
注: 针对网络因素爬虫对于抓取失败本身已经做了重试操作, 重试次数可以在 settings.json 配置. 上面提到的抓取 / 存取失败是指爬虫多次尝试后依然失败的网址, 一般情况下是由于抓取规则不正确或者 hbase 故障引起的.
例如: count:20150203, hash 类型, 抓取行为的增量统计, 实例文件夹下 spider_extend.js 中各个定制化函数中有增量统计的语句, 默认是注释的, 打开后会做增量统计. 在 web 配置中心 Crawling Daily Report 就可以看到统计结果了.
该空间存储代理 IP 相关的数据
list 类型, 当前可用的代理 IP
抓取过于频繁,服务器返回429.这个时候需要切换代理IP了,推荐使用阿布云代理,阿布云代理IP,提供高匿代理,爬虫代理.