阿布云

你所需要的,不仅仅是一个好用的代理。

nodejs爬虫

阿布云 发表于

 

开源地址:http://git.oschina.net/dreamidea/neocrawler

一、概述

NEOCrawler(中文名:牛咖),是 nodejs、redis、phantomjs 实现的爬虫系统。代码完全开源,适合用于垂直领域的数据采集和爬虫二次开发。

【主要特点】

  • 使用 nodejs 实现,javascipt 简单、高效、易学、为爬虫的开发以及爬虫使用者的二次开发节约不少时间;nodejs 使用 Google V8 作为运行引擎,性能可观;由于 nodejs 语言本身非阻塞、异步的特性,运行爬虫这类 IO 密集 CPU 需求不敏感的系统表现很出色,与其他语言的版本简单的比较,开发量小于 C/C++/JAVA,性能高于 JAVA 的多线程实现以及 Python 的异步和携程方式的实现。
  • 调度中心负责网址的调度,爬虫进程分布式运行,即中央调度器统一决策单个时间片内抓取哪些网址,并协调各爬虫工作,爬虫单点故障不影响整体系统。
  • 爬虫在抓取时就对网页进行了结构化解析,摘取到需要的数据字段,入库时不仅是网页源代码还有结构化了的各字段数据,不仅使得网页抓取后数据立马可用,而且便于实现入库时的精准化的内容排重。
  • 集成了 phantomjs。phantomjs 是无需图形界面环境的网页浏览器实现,利用它可以抓取需要 js 执行才产生内容的网页。通过 js 语句执行页面上的用户动作,实现表单填充提交后再抓取下一页内容、点击按钮后页面跳转再抓取下一页内容等。
  • 重试及容错机制。http 请求有各种意外情况,都有重试机制,失败后有详细记录便于人工排查。都返回的页面内容有校验机制,能检测到空白页,不完整页面或者是被代理服务器劫持的页面;
  • 可以预设 cookie,解决需要登录后才能抓取到内容的问题。
  • 限制并发数,避免因为连接数过多被源网站屏蔽 IP 的问题。
  • 集成了代理 IP 使用的功能,此项功能针对反抓取的网站(限单 IP 下访问数、流量、智能判断爬虫的),需要提供可用的代理 IP,爬虫会自主选择针对源网站还可以访问的代理 IP 地址来访问,源网站无法屏蔽抓取。
  • 产品化功能,爬虫系统的基础部分和具体业务实现部分架构上分离,业务实现部分不需要编码,可以用配置来完成。
  • Web 界面的抓取规则设置,热刷新到分布式爬虫。在 Web 界面配置抓取规则,保存后会自动将新规则刷新给运行在不同机器上的爬虫进程,规则调整不需要编码、不需要重启程序。

可配置项:

1). 用正则表达式来描述,类似的网页归为一类,使用相同的规则。一个爬虫系统(下面几条指的都是某类网址可配置项);

2). 起始地址、抓取方式、存储位置、页面处理方式等;

3). 需要收集的链接规则,用 CSS 选择符限定爬虫只收集出现在页面中某个位置的链接;

3). 页面摘取规则,可以用 CSS 选择符、正则表达式来定位每个字段内容要抽取的位置;

4). 预定义要在页面打开后注入执行的 js 语句;

5). 网页预设的 cookie;

6). 评判该类网页返回是否正常的规则,通常是指定一些网页返回正常后页面必然存在的关键词让爬虫检测;

7). 评判数据摘取是否完整的规则,摘取字段中选取几个非常必要的字段作为摘取是否完整的评判标准;

8). 该类网页的调度权重(优先级)、周期(多久后重新抓取更新)。

  • 为了减少冗余开发,根据抓取需求划分为实例,每个实例的基础配置(存储数据库、爬虫运行参数、定制化代码)都可以不同,即抓取应用配置的层次为:爬虫系统 -> 实例 -> 网址。
  • 爬虫系统结构上参考了 scrapy,由 core、spider、downloader、extractor、pipeline 组成,core 是各个组件的联合点和事件控制中心,spider 负责队列的进出,downloader 负责页面的下载,根据配置规则选择使用普通的 html 源码下载或者下载后用 phantomjs 浏览器环境渲染执行 js/css,extractor 根据摘取规则对文档进行结构化数据摘取,pipeline 负责将数据持久化或者输出给后续的数据处理系统。这些组件都提供定制化接口,如果通过配置不能满足需求,可以用 js 代码很容易个性化扩展某个组件的功能。

【架构】

提示:建议刚接触本系统的用户跳过架构介绍环节直接进入第二部分,先将系统运行起来,有一个感性认识后再来查阅架构的环节,如果您需要做深入的二次开发,请仔细阅读本环节资料

整体架构

图中黄色部分为爬虫系统的各个子系统。

  • SuperScheduler 是中央调度器,Spider 爬虫将收集到的网址放入到各类网址所对对应的网址库中,SuperScheduler 会依据调度规则从各类网址库抽取相应量的网址放入待抓取队列。
  • Spider 是分布式运行的爬虫程序,从调度器调度好的待抓取队列中取出任务进行抓取,将发现的网址放入网址库,摘取的内容存库,将爬虫程序分为 core 一个核心和 download、extract、pipeline 4 个中间件,是为了在爬虫实例中能够比较容易的重新定制其中某块功能。
  • ProxyRouter 是在使用代理 IP 的时候将爬虫请求智能路由给可用代理 IP 的。
  • Webconfig 是 web 的爬虫规则配置后台。

二、运行步骤

【运行环境准备】

  • 安装好 nodejs 环境,从 git 仓库 clone 源码到本地,在文件夹位置打开命令提示符,运行 “npm install” 安装依赖的模块;
  • redis server 安装(同时支持 redis 和 ssdb,从节约内存的角度考虑,可以使用 ssdb,在 setting.json 可以指定类型,下面会提到)。
  • hbase 环境,抓取到网页、摘取到的数据将存储到 hbase,hbase 安装完毕后要将 http rest 服务开启,后面的配置中会用到,如果要使用其他的数据库存储,可以不安装 hbase,下面的章节中将会讲到如何关闭 hbase 功能以及定制化自己的存储。hbase shell 中初始化 hbase 列簇:

create <span class="s1">'crawled'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"data"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'extra'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span> create <span class="s1">'crawled_bin'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"binary"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>

1

2

create <span class="s1">'crawled'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"data"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'extra'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>

create <span class="s1">'crawled_bin'</span>,<span class="o">{</span>NAME <span class="o">=</span>> <span class="s1">'basic'</span>, VERSIONS <span class="o">=</span>> 3<span class="o">}</span>,<span class="o">{</span><span class="nv">NAME</span><span class="o">=</span>><span class="s2">"binary"</span>,VERSIONS<span class="o">=</span>>3<span class="o">}</span>

推荐使用 hbase rest 方式, 当你启动 hbase 后, 在 hbase 目录的 bin 子目录下执行以下命令可以启动 hbase rest:

./hbase-daemon.sh start rest

1

./hbase-daemon.sh start rest

默认端口为 8080, 下面配置中会用到.

【实例配置】

  • 实例在 instance 目录下,拷贝一份 example,重命名其他的实例名,例如:abc,以下说明中均使用该实例名举例。
  • 编辑 instance/abc/setting.json

<span class="p">{</span> <span class="cm">/*注意:此处用于解释各项配置,真正的setting.json中不能包含注释*/</span> <span class="s2">"driller_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span><span class="cm">/*网址规则配置信息存储位置,最后一个数字表示redis的第几个数据库*/</span> <span class="s2">"url_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="cm">/*网址信息存储位置*/</span> <span class="s2">"url_report_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6380</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span><span class="cm">/*抓取错误信息存储位置*/</span> <span class="s2">"proxy_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span><span class="cm">/*http代理网址存储位置*/</span> <span class="s2">"use_proxy"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用代理服务*/</span> <span class="s2">"proxy_router"</span><span class="p">:</span><span class="s2">"127.0.0.1:2013"</span><span class="p">,</span><span class="cm">/*使用代理服务的情况下,代理服务的路由中心地址*/</span> <span class="s2">"download_timeout"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*下载超时时间,秒,不等同于相应超时*/</span> <span class="s2">"save_content_to_hbase"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否将抓取信息存储到hbase,目前只在0.94下测试过*/</span> <span class="s2">"crawled_hbase_conf"</span><span class="p">:[</span><span class="s2">"localhost"</span><span class="p">,</span><span class="mi">8080</span><span class="p">],</span><span class="cm">/*hbase rest的配置,你可以使用tcp方式连接,配置为{"zookeeperHosts": ["localhost:2181"],"zookeeperRoot": "/hbase"},此模式下有OOM Bug,不建议使用*/</span> <span class="s2">"crawled_hbase_table"</span><span class="p">:</span><span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*抓取的数据保存在hbase的表*/</span> <span class="s2">"crawled_hbase_bin_table"</span><span class="p">:</span><span class="s2">"crawled_bin"</span><span class="p">,</span><span class="cm">/*抓取的二进制数据保存在hbase的表*/</span> <span class="s2">"statistic_mysql_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">3306</span><span class="p">,</span><span class="s2">"crawling"</span><span class="p">,</span><span class="s2">"crawler"</span><span class="p">,</span><span class="s2">"123"</span><span class="p">],</span><span class="cm">/*用来存储抓取日志分析结果,需要结合flume来实现,一般不使用此项*/</span> <span class="s2">"check_driller_rules_interval"</span><span class="p">:</span><span class="mi">120</span><span class="p">,</span><span class="cm">/*多久检测一次网址规则的变化以便热刷新到运行中的爬虫*/</span> <span class="s2">"spider_concurrency"</span><span class="p">:</span><span class="mi">5</span><span class="p">,</span><span class="cm">/*爬虫的抓取页面并发请求数*/</span> <span class="s2">"spider_request_delay"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="cm">/*两个并发请求之间的间隔时间,秒*/</span> <span class="s2">"schedule_interval"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*调度器两次调度的间隔时间*/</span> <span class="s2">"schedule_quantity_limitation"</span><span class="p">:</span><span class="mi">200</span><span class="p">,</span><span class="cm">/*调度器给爬虫的最大网址待抓取数量*/</span> <span class="s2">"download_retry"</span><span class="p">:</span><span class="mi">3</span><span class="p">,</span><span class="cm">/*错误重试次数*/</span> <span class="s2">"log_level"</span><span class="p">:</span><span class="s2">"DEBUG"</span><span class="p">,</span><span class="cm">/*日志级别*/</span> <span class="s2">"use_ssdb"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用ssdb*/</span> <span class="s2">"to_much_fail_exit"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*错误太多的时候是否自动终止爬虫*/</span> <span class="s2">"keep_link_relation"</span><span class="p">:</span><span class="kc">false</span><span class="cm">/*链接库里是否存储链接间关系*/</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

<span class="p">{</span>

    <span class="cm">/*注意:此处用于解释各项配置,真正的setting.json中不能包含注释*/</span>

 

    <span class="s2">"driller_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span><span class="cm">/*网址规则配置信息存储位置,最后一个数字表示redis的第几个数据库*/</span>

    <span class="s2">"url_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="cm">/*网址信息存储位置*/</span>

    <span class="s2">"url_report_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6380</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span><span class="cm">/*抓取错误信息存储位置*/</span>

    <span class="s2">"proxy_info_redis_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">6379</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span><span class="cm">/*http代理网址存储位置*/</span>

    <span class="s2">"use_proxy"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用代理服务*/</span>

    <span class="s2">"proxy_router"</span><span class="p">:</span><span class="s2">"127.0.0.1:2013"</span><span class="p">,</span><span class="cm">/*使用代理服务的情况下,代理服务的路由中心地址*/</span>

    <span class="s2">"download_timeout"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*下载超时时间,秒,不等同于相应超时*/</span>

    <span class="s2">"save_content_to_hbase"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否将抓取信息存储到hbase,目前只在0.94下测试过*/</span>

    <span class="s2">"crawled_hbase_conf"</span><span class="p">:[</span><span class="s2">"localhost"</span><span class="p">,</span><span class="mi">8080</span><span class="p">],</span><span class="cm">/*hbase rest的配置,你可以使用tcp方式连接,配置为{"zookeeperHosts": ["localhost:2181"],"zookeeperRoot": "/hbase"},此模式下有OOM Bug,不建议使用*/</span>

    <span class="s2">"crawled_hbase_table"</span><span class="p">:</span><span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*抓取的数据保存在hbase的表*/</span>

    <span class="s2">"crawled_hbase_bin_table"</span><span class="p">:</span><span class="s2">"crawled_bin"</span><span class="p">,</span><span class="cm">/*抓取的二进制数据保存在hbase的表*/</span>

    <span class="s2">"statistic_mysql_db"</span><span class="p">:[</span><span class="s2">"127.0.0.1"</span><span class="p">,</span><span class="mi">3306</span><span class="p">,</span><span class="s2">"crawling"</span><span class="p">,</span><span class="s2">"crawler"</span><span class="p">,</span><span class="s2">"123"</span><span class="p">],</span><span class="cm">/*用来存储抓取日志分析结果,需要结合flume来实现,一般不使用此项*/</span>

    <span class="s2">"check_driller_rules_interval"</span><span class="p">:</span><span class="mi">120</span><span class="p">,</span><span class="cm">/*多久检测一次网址规则的变化以便热刷新到运行中的爬虫*/</span>

    <span class="s2">"spider_concurrency"</span><span class="p">:</span><span class="mi">5</span><span class="p">,</span><span class="cm">/*爬虫的抓取页面并发请求数*/</span>

    <span class="s2">"spider_request_delay"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="cm">/*两个并发请求之间的间隔时间,秒*/</span>

    <span class="s2">"schedule_interval"</span><span class="p">:</span><span class="mi">60</span><span class="p">,</span><span class="cm">/*调度器两次调度的间隔时间*/</span>

    <span class="s2">"schedule_quantity_limitation"</span><span class="p">:</span><span class="mi">200</span><span class="p">,</span><span class="cm">/*调度器给爬虫的最大网址待抓取数量*/</span>

    <span class="s2">"download_retry"</span><span class="p">:</span><span class="mi">3</span><span class="p">,</span><span class="cm">/*错误重试次数*/</span>

    <span class="s2">"log_level"</span><span class="p">:</span><span class="s2">"DEBUG"</span><span class="p">,</span><span class="cm">/*日志级别*/</span>

    <span class="s2">"use_ssdb"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*是否使用ssdb*/</span>

    <span class="s2">"to_much_fail_exit"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*错误太多的时候是否自动终止爬虫*/</span>

    <span class="s2">"keep_link_relation"</span><span class="p">:</span><span class="kc">false</span><span class="cm">/*链接库里是否存储链接间关系*/</span>

<span class="p">}</span>

【运行】

  • 爬虫运行的基本步骤是: *
  • 在 WEB 界面配置抓取规则
  • 调试单个网址抓取是否正常
  • 运行调度器 (调度器启动一个即可)
  • 如果使用代理 IP 抓取的话启动代理路由
  • 启动爬虫 (爬虫可以分布式启动多个)

以下是具体的启动命令

  • 运行 WEB 配置 (配置规则参考下一章说明)

    node run.js -i abc -a config -p 8888

    在浏览器打开 http://localhost:8888 可以在 web 界面配置抓取规则

  • 测试单个页面抓取

    node run.js -i abc -a test -l “http://domain/page/

  • 运行调度器

    node run.js -i abc -a schedule

-i 指定了实例名,-a 指定了动作 schedule,下同

  • 运行代理路由 仅使用了代理 IP 抓取的情况下才需要运行代理路由 > node run.js -i abc -a proxy -p 2013

此处的 – p 指定代理路由的端口,如果在本机运行,setting.json 的 proxy_router 及端口为 127.0.0.1:2013

  • 运行爬虫 > node run.js -i abc -a crawl

可以在 instance/example/logs 下查看输出日志 debug-result.json

在正式运行的环境下建议使用 nodejs 的 pm2 或者 python 的 supervisor 来托管进程.

【抓取规则配置】

打开 web 界面, 例如:http://localhost:8888/ , 进入 “Drilling Rules”,添加规则。这是一个 json 编辑器,可以在代码模式 / 可视化模式之间切换。下面给出配置项的说明. 具体的应用配置可以参考下一章节的示例.

<span class="p">{</span> <span class="cm">/*注意:此处用于解释各项配置,真正的配置代码中不能包含注释*/</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*顶级域名,例如163.com(不带主机名,www.163.com是错误的)*/</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*网址规则,正则表达式,例如:^http://domain/\d+\.html,限定范围越精确越好*/</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*给该规则取的别名*/</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*该网址可以带的有效参数,如果数组第一个值为#,表示过滤一切参数*/</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span><span class="cm">/*页面编码,auto表示自动检测,可以填写具体值:gbk,utf-8*/</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span><span class="cm">/*页面类型,分支branch或者节点node*/</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否保存html源代码*/</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span><span class="cm">/*页面形式,html/json/binary*/</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*是否需要处理js,决定了爬虫是否用phantomjs加载页面*/</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*摘取规则,后面单独详述*/</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*如果不摘取数据,rule应该为空*/</span> <span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,后面单独详述*/</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">}</span> <span class="p">}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*cookie值,有多个object组成,每个object是一个cookie值*/</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否注入jquery*/</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否载入图片*/</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"a"</span><span class="p">],</span><span class="cm">/*页面中感兴趣的链接,填写css选择符选择a元素,可以为多个,此处表示所有链接*/</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,从页面中摘取一个值来填充上下文关系对此页的描述*/</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*验证页面下载是否有效的关键词,可以为多个,为空表示不验证*/</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*在页面中执行的脚本,可以为多个,依次对应每个层级下的执行。以js_result=..形式*/</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*自动导航,css选择符,可以为多个,依次对应每个层级,phantomjs将点击匹配的元素进行导航*/</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="cm">/*导航几个层级后停止*/</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span><span class="cm">/*调度优先级,数字越小越优先*/</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span><span class="cm">/*调度权重,数字越大越有限*/</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span><span class="cm">/*重新调度的周期,单位秒*/</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否激活该规则*/</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*种子地址,重新调度时从这些网址开始*/</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="cm">/*调度方式,FIFO或者LIFO*/</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

<span class="p">{</span>

  <span class="cm">/*注意:此处用于解释各项配置,真正的配置代码中不能包含注释*/</span>

  <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*顶级域名,例如163.com(不带主机名,www.163.com是错误的)*/</span>

  <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*网址规则,正则表达式,例如:^http://domain/\d+\.html,限定范围越精确越好*/</span>

  <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span><span class="cm">/*给该规则取的别名*/</span>

  <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*该网址可以带的有效参数,如果数组第一个值为#,表示过滤一切参数*/</span>

  <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span><span class="cm">/*页面编码,auto表示自动检测,可以填写具体值:gbk,utf-8*/</span>

  <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span><span class="cm">/*页面类型,分支branch或者节点node*/</span>

  <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否保存html源代码*/</span>

  <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span><span class="cm">/*页面形式,html/json/binary*/</span>

  <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*是否需要处理js,决定了爬虫是否用phantomjs加载页面*/</span>

  <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*摘取规则,后面单独详述*/</span>

    <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>

    <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*如果不摘取数据,rule应该为空*/</span>

      <span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,后面单独详述*/</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">}</span>

    <span class="p">}</span>

  <span class="p">},</span>

  <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*cookie值,有多个object组成,每个object是一个cookie值*/</span>

  <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否注入jquery*/</span>

  <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span><span class="cm">/*在使用phantomjs的情况下是否载入图片*/</span>

  <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"a"</span><span class="p">],</span><span class="cm">/*页面中感兴趣的链接,填写css选择符选择a元素,可以为多个,此处表示所有链接*/</span>

  <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,从页面中摘取一个值来填充上下文关系对此页的描述*/</span>

    <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

    <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

    <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>

    <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

    <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

  <span class="p">},</span>

  <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*验证页面下载是否有效的关键词,可以为多个,为空表示不验证*/</span>

  <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*在页面中执行的脚本,可以为多个,依次对应每个层级下的执行。以js_result=..形式*/</span>

  <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*自动导航,css选择符,可以为多个,依次对应每个层级,phantomjs将点击匹配的元素进行导航*/</span>

  <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="cm">/*导航几个层级后停止*/</span>

  <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span><span class="cm">/*调度优先级,数字越小越优先*/</span>

  <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span><span class="cm">/*调度权重,数字越大越有限*/</span>

  <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span><span class="cm">/*重新调度的周期,单位秒*/</span>

  <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span><span class="cm">/*是否激活该规则*/</span>

  <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span><span class="cm">/*种子地址,重新调度时从这些网址开始*/</span>

  <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="cm">/*调度方式,FIFO或者LIFO*/</span>

<span class="p">}</span>

摘取单元

<span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span><span class="cm">/*基于什么摘取,网页DOM:content或者给予url*/</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span><span class="cm">/*摘取模式,css或者regex表示css选择符或者正则表达式,value表示给固定值*/</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span><span class="cm">/*表达式,与mode相对应css选择符表达式或者正则表达式,或者一个固定的值*/</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span><span class="cm">/*css模式下摘取一个元素的属性或者值,text、html表示文本值或者标签代码,@href表示href属性值,其他属性依次类推在前面加@符号*/</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="cm">/*当有多个元素时,选取第几个元素,-1表示选择多个,将返回数组值*/</span> <span class="p">}</span>

1

2

3

4

5

6

7

<span class="p">{</span>

<span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span><span class="cm">/*基于什么摘取,网页DOM:content或者给予url*/</span>

  <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span><span class="cm">/*摘取模式,css或者regex表示css选择符或者正则表达式,value表示给固定值*/</span>

  <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span><span class="cm">/*表达式,与mode相对应css选择符表达式或者正则表达式,或者一个固定的值*/</span>

  <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span><span class="cm">/*css模式下摘取一个元素的属性或者值,text、html表示文本值或者标签代码,@href表示href属性值,其他属性依次类推在前面加@符号*/</span>

  <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="cm">/*当有多个元素时,选取第几个元素,-1表示选择多个,将返回数组值*/</span>

<span class="p">}</span>

摘取规则

<span class="cm">/*摘取规则由多个摘取单元构成,它们之间的基本结构如下*/</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*该节点存储到hbase的表名称*/</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*具体规则*/</span> <span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,规则参考上面的说明*/</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">"subset"</span><span class="p">:{</span><span class="cm">/*子集*/</span> <span class="s2">"category"</span><span class="p">:</span><span class="s2">"comment"</span><span class="p">,</span><span class="cm">/*属于comment(存储到comment)*/</span> <span class="s2">"relate"</span><span class="p">:</span><span class="s2">"#title#"</span><span class="p">,</span><span class="cm">/*与上级关联*/</span> <span class="s2">"mapping"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*子集类型,mapping为true将分到另外的表中单独存储*/</span> <span class="s2">"rule"</span><span class="p">:{</span> <span class="s2">"profile"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@href"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="cm">/*摘取单元*/</span> <span class="s2">"message"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@alt"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">}</span> <span class="p">},</span> <span class="s2">"require"</span><span class="p">:[</span><span class="s2">"profile"</span><span class="p">]</span><span class="cm">/*必须字段*/</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> <span class="s2">"require"</span><span class="p">:[</span><span class="s2">"title"</span><span class="p">]</span><span class="cm">/*必须字段,如果里面的值为数组,表示这个数组内的值有任意一个就满足要求,例如[[a,b],c]*/</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

<span class="cm">/*摘取规则由多个摘取单元构成,它们之间的基本结构如下*/</span>

<span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span><span class="cm">/*该节点存储到hbase的表名称*/</span>

    <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*具体规则*/</span>

      <span class="s2">"title"</span><span class="p">:</span> <span class="p">{</span><span class="cm">/*一个摘取单元,规则参考上面的说明*/</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>

        <span class="s2">"subset"</span><span class="p">:{</span><span class="cm">/*子集*/</span>

              <span class="s2">"category"</span><span class="p">:</span><span class="s2">"comment"</span><span class="p">,</span><span class="cm">/*属于comment(存储到comment)*/</span>

              <span class="s2">"relate"</span><span class="p">:</span><span class="s2">"#title#"</span><span class="p">,</span><span class="cm">/*与上级关联*/</span>

              <span class="s2">"mapping"</span><span class="p">:</span><span class="kc">false</span><span class="p">,</span><span class="cm">/*子集类型,mapping为true将分到另外的表中单独存储*/</span>

              <span class="s2">"rule"</span><span class="p">:{</span>

                  <span class="s2">"profile"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@href"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="cm">/*摘取单元*/</span>

                  <span class="s2">"message"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"base"</span><span class="p">:</span><span class="s2">"content"</span><span class="p">,</span><span class="s2">"mode"</span><span class="p">:</span><span class="s2">"css"</span><span class="p">,</span><span class="s2">"expression"</span><span class="p">:</span><span class="s2">".classname"</span><span class="p">,</span><span class="s2">"pick"</span><span class="p">:</span><span class="s2">"@alt"</span><span class="p">,</span><span class="s2">"index"</span><span class="p">:</span><span class="mi">1</span><span class="p">}</span>

                <span class="p">},</span>

              <span class="s2">"require"</span><span class="p">:[</span><span class="s2">"profile"</span><span class="p">]</span><span class="cm">/*必须字段*/</span>

            <span class="p">}</span>

      <span class="p">}</span>

    <span class="p">}</span>

  <span class="s2">"require"</span><span class="p">:[</span><span class="s2">"title"</span><span class="p">]</span><span class="cm">/*必须字段,如果里面的值为数组,表示这个数组内的值有任意一个就满足要求,例如[[a,b],c]*/</span>

  <span class="p">}</span>

三、简单示例

此步骤假设你已经将 web 配置后台运行起来了, 如何运行 web 配置请按照上一章节的说明

下面列出一个抓取微信号的配置例子. 假设我们的意图是抓取 http://www.sovxin.com 上的所有微信号.

第一步是观察网站的结构, 大概可以分为 4 个层次: 首页, 分类频道页, 列表页, 详情页. 我们根据这个页面层次来进行抓取规则的配置, 其中, 只有详情页是需要配置字段摘取信息的, 其他 3 种页面都是用来逐步发现详情页的. 我们将这个顺序倒过来, 从详情页开始配置, 最后再看首页.

规则列表截图 (你的界面肯定还没有这些规则列表, 点击 Add 添加规则, 参考我下面列出的配置)

详情页 (摘取实际内容的)

你应当对照上一章节讲到的每个配置项的说明来理解这个示例

可以将编辑器切换到代码模式, 将下面的 json 粘贴到里面.

<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/weixin_\\d+.html$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"detail"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"nickname"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"name"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">微信号:(.*?)</td>"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"subtype"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">账号类型:(.*?)</td>"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"location"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".js_other>._o_left>a"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"description"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".introduction"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"logo"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".avatar>img"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"qrcode"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".erweima"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"class"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(2)"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"subclass"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(3)"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">}</span> <span class="p">},</span> <span class="s2">"require"</span><span class="p">:</span> <span class="p">[</span> <span class="p">[</span> <span class="s2">"name"</span><span class="p">,</span> <span class="s2">"oid"</span><span class="p">,</span> <span class="s2">"nickname"</span> <span class="p">]</span> <span class="p">]</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span><span class="p">,</span> <span class="s2">".avatar>img"</span><span class="p">,</span> <span class="s2">".erweima"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"当前位置"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">8640000</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1408456940902</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

<span class="p">{</span>

  <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>

  <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/weixin_\\d+.html$"</span><span class="p">,</span>

  <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"detail"</span><span class="p">,</span>

  <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"#"</span>

  <span class="p">],</span>

  <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>

  <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"node"</span><span class="p">,</span>

  <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>

  <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>

    <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{</span>

      <span class="s2">"nickname"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"name"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">微信号:(.*?)</td>"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"subtype"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"regex"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">">账号类型:(.*?)</td>"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"location"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".js_other>._o_left>a"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"description"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".introduction"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"logo"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".avatar>img"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"qrcode"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">".erweima"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"@src"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"class"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(2)"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">},</span>

      <span class="s2">"subclass"</span><span class="p">:</span> <span class="p">{</span>

        <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

        <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

        <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._vb_weizhi>a:nth-child(3)"</span><span class="p">,</span>

        <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

        <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

      <span class="p">}</span>

    <span class="p">},</span>

    <span class="s2">"require"</span><span class="p">:</span> <span class="p">[</span>

      <span class="p">[</span>

        <span class="s2">"name"</span><span class="p">,</span>

        <span class="s2">"oid"</span><span class="p">,</span>

        <span class="s2">"nickname"</span>

      <span class="p">]</span>

    <span class="p">]</span>

  <span class="p">},</span>

  <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"a"</span><span class="p">,</span>

    <span class="s2">".avatar>img"</span><span class="p">,</span>

    <span class="s2">".erweima"</span>

  <span class="p">],</span>

  <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

    <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

    <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"._title>strong"</span><span class="p">,</span>

    <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

    <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

  <span class="p">},</span>

  <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"当前位置"</span>

  <span class="p">],</span>

  <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>

  <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>

  <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>

  <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">8640000</span><span class="p">,</span>

  <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>

  <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>

  <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1408456940902</span>

<span class="p">}</span>

列表页 (通过它摘取到上面配置的详情页链接以及本身的分页链接)

<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/t_.*?.html$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"list"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"当前位置"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"http://www.sovxin.com/t_xiuxianyule_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_jiankangshenghuo_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_wenhuajiaoyu_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_jiaoliu_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_qiyepinpai_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_mingxingmingren_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_youguanbumen_#.html#1#300#1"</span><span class="p">,</span> <span class="s2">"http://www.sovxin.com/t_zonghe_#.html#1#300#1"</span> <span class="p">],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

<span class="p">{</span>

  <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>

  <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/t_.*?.html$"</span><span class="p">,</span>

  <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"list"</span><span class="p">,</span>

  <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"#"</span>

  <span class="p">],</span>

  <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>

  <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span>

  <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>

  <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>

    <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span>

  <span class="p">},</span>

  <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"a"</span>

  <span class="p">],</span>

  <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

    <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

    <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>

    <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

    <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

  <span class="p">},</span>

  <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"当前位置"</span>

  <span class="p">],</span>

  <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>

  <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>

  <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>

  <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span>

  <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>

  <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"http://www.sovxin.com/t_xiuxianyule_#.html#1#300#1"</span><span class="p">,</span>

    <span class="s2">"http://www.sovxin.com/t_jiankangshenghuo_#.html#1#300#1"</span><span class="p">,</span>

    <span class="s2">"http://www.sovxin.com/t_wenhuajiaoyu_#.html#1#300#1"</span><span class="p">,</span>

    <span class="s2">"http://www.sovxin.com/t_jiaoliu_#.html#1#300#1"</span><span class="p">,</span>

    <span class="s2">"http://www.sovxin.com/t_qiyepinpai_#.html#1#300#1"</span><span class="p">,</span>

    <span class="s2">"http://www.sovxin.com/t_mingxingmingren_#.html#1#300#1"</span><span class="p">,</span>

    <span class="s2">"http://www.sovxin.com/t_youguanbumen_#.html#1#300#1"</span><span class="p">,</span>

    <span class="s2">"http://www.sovxin.com/t_zonghe_#.html#1#300#1"</span>

  <span class="p">],</span>

  <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>

  <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span>

<span class="p">}</span>

分类频道页 (通过它摘取到上面配置的列表页链接)

<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/fenlei_.*?.html$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"category"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"当前位置"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"http://www.sovxin.com/fenlei_zixun.html"</span> <span class="p">],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

<span class="p">{</span>

  <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>

  <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/fenlei_.*?.html$"</span><span class="p">,</span>

  <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"category"</span><span class="p">,</span>

  <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"#"</span>

  <span class="p">],</span>

  <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>

  <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span>

  <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>

  <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>

    <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span>

  <span class="p">},</span>

  <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"a"</span>

  <span class="p">],</span>

  <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

    <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

    <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>

    <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

    <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

  <span class="p">},</span>

  <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"当前位置"</span>

  <span class="p">],</span>

  <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>

  <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>

  <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>

  <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span>

  <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>

  <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"http://www.sovxin.com/fenlei_zixun.html"</span>

  <span class="p">],</span>

  <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>

  <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span>

<span class="p">}</span>

首页 (通过它摘取到上面配置的频道分类页面)

<span class="p">{</span> <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span> <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/$"</span><span class="p">,</span> <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"home"</span><span class="p">,</span> <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"#"</span> <span class="p">],</span> <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span> <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span> <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span> <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span> <span class="p">},</span> <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"a"</span> <span class="p">],</span> <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span> <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span> <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span> <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span> <span class="p">},</span> <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"搜微信"</span> <span class="p">],</span> <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span> <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span> <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span> <span class="s2">"http://www.sovxin.com/"</span> <span class="p">],</span> <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span> <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span> <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

<span class="p">{</span>

  <span class="s2">"domain"</span><span class="p">:</span> <span class="s2">"sovxin.com"</span><span class="p">,</span>

  <span class="s2">"url_pattern"</span><span class="p">:</span> <span class="s2">"^http://www.sovxin.com/$"</span><span class="p">,</span>

  <span class="s2">"alias"</span><span class="p">:</span> <span class="s2">"home"</span><span class="p">,</span>

  <span class="s2">"id_parameter"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"#"</span>

  <span class="p">],</span>

  <span class="s2">"encoding"</span><span class="p">:</span> <span class="s2">"auto"</span><span class="p">,</span>

  <span class="s2">"type"</span><span class="p">:</span> <span class="s2">"branch"</span><span class="p">,</span>

  <span class="s2">"save_page"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"format"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span>

  <span class="s2">"jshandle"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"extract_rule"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"category"</span><span class="p">:</span> <span class="s2">"crawled"</span><span class="p">,</span>

    <span class="s2">"rule"</span><span class="p">:</span> <span class="p">{}</span>

  <span class="p">},</span>

  <span class="s2">"cookie"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"inject_jquery"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"load_img"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"drill_rules"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"a"</span>

  <span class="p">],</span>

  <span class="s2">"drill_relation"</span><span class="p">:</span> <span class="p">{</span>

    <span class="s2">"base"</span><span class="p">:</span> <span class="s2">"content"</span><span class="p">,</span>

    <span class="s2">"mode"</span><span class="p">:</span> <span class="s2">"css"</span><span class="p">,</span>

    <span class="s2">"expression"</span><span class="p">:</span> <span class="s2">"title"</span><span class="p">,</span>

    <span class="s2">"pick"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span>

    <span class="s2">"index"</span><span class="p">:</span> <span class="mi">1</span>

  <span class="p">},</span>

  <span class="s2">"validation_keywords"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"搜微信"</span>

  <span class="p">],</span>

  <span class="s2">"script"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"navigate_rule"</span><span class="p">:</span> <span class="p">[],</span>

  <span class="s2">"stoppage"</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>

  <span class="s2">"priority"</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span>

  <span class="s2">"weight"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>

  <span class="s2">"schedule_interval"</span><span class="p">:</span> <span class="mi">86400</span><span class="p">,</span>

  <span class="s2">"active"</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>

  <span class="s2">"seed"</span><span class="p">:</span> <span class="p">[</span>

    <span class="s2">"http://www.sovxin.com/"</span>

  <span class="p">],</span>

  <span class="s2">"schedule_rule"</span><span class="p">:</span> <span class="s2">"FIFO"</span><span class="p">,</span>

  <span class="s2">"use_proxy"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>

  <span class="s2">"first_schedule"</span><span class="p">:</span> <span class="mi">1414938594585</span>

<span class="p">}</span>

四、进阶示例

数据存储的定制化

抓取的数据默认是存储到 hbase, 你可以可以将这种默认行为取消, 将数据存储到其他类型的数据库. 修改 instance / 你的实例 / settings.json, 将 save_content_to_hbase 设置为 false. 然后修改 instance / 你的实例 / spider_extend.js, 这里是你定制化开发的地方, 将 pipeline 方法的注释拿掉, 爬虫抓完页面后会调用该函数, 传入一个 extracted_info 是摘取后的结构化数据, 另一个参数 callback 是回调函数要求你在做完你做的事情 (实际上就是存数据到你的数据库).extracted_info 的结构你可以 console.dir(extracted_info) 或者 Webstorm IDE 内断点调试以下就能看到. 以下代码 (存储到 mongodb) 仅供参考

<span class="cm">/** * instead of main framework content pipeline * if it do nothing , comment it * @param extracted_info (same to extract) */</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">pipeline</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span> <span class="kd">var</span> <span class="nx">spider_extend</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">]</span><span class="o">||</span><span class="nx">isEmpty</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">])){</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="s1">'data of '</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is empty.'</span><span class="p">);</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">}</span><span class="k">else</span><span class="p">{</span> <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">];</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">trim</span><span class="p">()</span><span class="o">!=</span><span class="s2">""</span><span class="p">){</span> <span class="kd">var</span> <span class="nx">_id</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">puerContent</span> <span class="o">=</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">[^\u</span><span class="sr">4e00-</span><span class="se">\u</span><span class="sr">9fa5a-z0-9</span><span class="se">]</span><span class="sr">/ig</span><span class="p">,</span><span class="s1">''</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">simplefp</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">puerContent</span><span class="p">).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">currentTime</span> <span class="o">=</span> <span class="p">(</span><span class="k">new</span> <span class="nb">Date</span><span class="p">()).</span><span class="nx">getTime</span><span class="p">();</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'updated'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'published'</span><span class="p">]</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span> <span class="c1">//drop additional info</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">];</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">];</span> <span class="c1">//format relation to array</span> <span class="k">if</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">]){</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'relation'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">'->'</span><span class="p">);</span> <span class="p">}</span> <span class="c1">//get domain</span> <span class="kd">var</span> <span class="nx">urlibarr</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">][</span><span class="s1">'urllib'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">':'</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">urlibarr</span><span class="p">[</span><span class="nx">urlibarr</span><span class="p">.</span><span class="nx">length</span><span class="o">-</span><span class="mi">2</span><span class="p">];</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'domain'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">domain</span><span class="p">;</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'get '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' from '</span><span class="o">+</span><span class="nx">domain</span><span class="o">+</span><span class="s1">'('</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">')'</span><span class="p">);</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">];</span> <span class="kd">var</span> <span class="nx">query</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">"$or"</span><span class="p">:[</span> <span class="p">{</span> <span class="s1">'_id'</span><span class="p">:</span><span class="nx">_id</span> <span class="p">},</span> <span class="p">{</span> <span class="s1">'simplefp'</span><span class="p">:</span><span class="nx">simplefp</span> <span class="p">}</span> <span class="p">]</span> <span class="p">};</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">findOne</span><span class="p">(</span><span class="nx">query</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">item</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="nx">err</span><span class="p">){</span><span class="k">throw</span> <span class="nx">err</span><span class="p">;</span><span class="nx">callback</span><span class="p">();}</span> <span class="k">else</span><span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="nx">item</span><span class="p">){</span> <span class="c1">//if the new data of field less than the old, drop it</span> <span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">nlist</span><span class="p">){</span> <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">c</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="nx">c</span><span class="o"><</span><span class="nx">nlist</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span><span class="nx">c</span><span class="o">++</span><span class="p">)</span> <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="o"><</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="p">)</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]];</span> <span class="p">})([</span><span class="s1">'title'</span><span class="p">,</span><span class="s1">'article'</span><span class="p">,</span><span class="s1">'tags'</span><span class="p">,</span><span class="s1">'keywords'</span><span class="p">]);</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">update</span><span class="p">({</span><span class="s1">'_id'</span><span class="p">:</span><span class="nx">item</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]},{</span><span class="na">$set</span><span class="p">:</span><span class="nx">data</span><span class="p">},</span> <span class="p">{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span><span class="nx">result</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">)</span> <span class="p">{</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'update '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' to mongodb, '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' --override-> '</span> <span class="o">+</span> <span class="nx">item</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="p">}</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">});</span> <span class="p">}</span><span class="k">else</span><span class="p">{</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'simplefp'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">simplefp</span><span class="p">;</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">_id</span><span class="p">;</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'created'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">insert</span><span class="p">(</span><span class="nx">data</span><span class="p">,{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">result</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">){</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'insert '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' to mongodb'</span><span class="p">);</span> <span class="p">}</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">});</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span><span class="k">else</span><span class="p">{</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is lack of content, drop it'</span><span class="p">);</span> <span class="nx">callback</span><span class="p">();</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

<span class="cm">/**

* instead of main framework content pipeline

* if it do nothing , comment it

* @param extracted_info (same to extract)

*/</span>

<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">pipeline</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span>

    <span class="kd">var</span> <span class="nx">spider_extend</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span>

    <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">]</span><span class="o">||</span><span class="nx">isEmpty</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">])){</span>

        <span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="s1">'data of '</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is empty.'</span><span class="p">);</span>

        <span class="nx">callback</span><span class="p">();</span>

    <span class="p">}</span><span class="k">else</span><span class="p">{</span>

        <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'extracted_data'</span><span class="p">];</span>

        <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">trim</span><span class="p">()</span><span class="o">!=</span><span class="s2">""</span><span class="p">){</span>

            <span class="kd">var</span> <span class="nx">_id</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span>

            <span class="kd">var</span> <span class="nx">puerContent</span> <span class="o">=</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'article'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">[^\u</span><span class="sr">4e00-</span><span class="se">\u</span><span class="sr">9fa5a-z0-9</span><span class="se">]</span><span class="sr">/ig</span><span class="p">,</span><span class="s1">''</span><span class="p">);</span>

            <span class="kd">var</span> <span class="nx">simplefp</span> <span class="o">=</span> <span class="nx">crypto</span><span class="p">.</span><span class="nx">createHash</span><span class="p">(</span><span class="s1">'md5'</span><span class="p">).</span><span class="nx">update</span><span class="p">(</span><span class="nx">puerContent</span><span class="p">).</span><span class="nx">digest</span><span class="p">(</span><span class="s1">'hex'</span><span class="p">);</span>

 

            <span class="kd">var</span> <span class="nx">currentTime</span> <span class="o">=</span> <span class="p">(</span><span class="k">new</span> <span class="nb">Date</span><span class="p">()).</span><span class="nx">getTime</span><span class="p">();</span>

            <span class="nx">data</span><span class="p">[</span><span class="s1">'updated'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span>

            <span class="nx">data</span><span class="p">[</span><span class="s1">'published'</span><span class="p">]</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span>

 

            <span class="c1">//drop additional info</span>

            <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$category'</span><span class="p">];</span>

            <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">])</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'$require'</span><span class="p">];</span>

 

            <span class="c1">//format relation to array</span>

            <span class="k">if</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">]){</span>

                <span class="nx">data</span><span class="p">[</span><span class="s1">'relation'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'drill_relation'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">'->'</span><span class="p">);</span>

            <span class="p">}</span>

 

            <span class="c1">//get domain</span>

            <span class="kd">var</span> <span class="nx">urlibarr</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">][</span><span class="s1">'urllib'</span><span class="p">].</span><span class="nx">split</span><span class="p">(</span><span class="s1">':'</span><span class="p">);</span>

            <span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">urlibarr</span><span class="p">[</span><span class="nx">urlibarr</span><span class="p">.</span><span class="nx">length</span><span class="o">-</span><span class="mi">2</span><span class="p">];</span>

            <span class="nx">data</span><span class="p">[</span><span class="s1">'domain'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">domain</span><span class="p">;</span>

 

            <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'get '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' from '</span><span class="o">+</span><span class="nx">domain</span><span class="o">+</span><span class="s1">'('</span><span class="o">+</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">')'</span><span class="p">);</span>

            <span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">];</span>

 

            <span class="kd">var</span> <span class="nx">query</span> <span class="o">=</span> <span class="p">{</span>

                <span class="s2">"$or"</span><span class="p">:[</span>

                    <span class="p">{</span>

                        <span class="s1">'_id'</span><span class="p">:</span><span class="nx">_id</span>

                    <span class="p">},</span>

                    <span class="p">{</span>

                        <span class="s1">'simplefp'</span><span class="p">:</span><span class="nx">simplefp</span>

                    <span class="p">}</span>

                <span class="p">]</span>

            <span class="p">};</span>

            <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">findOne</span><span class="p">(</span><span class="nx">query</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">item</span><span class="p">)</span> <span class="p">{</span>

                <span class="k">if</span><span class="p">(</span><span class="nx">err</span><span class="p">){</span><span class="k">throw</span> <span class="nx">err</span><span class="p">;</span><span class="nx">callback</span><span class="p">();}</span>

                <span class="k">else</span><span class="p">{</span>

                    <span class="k">if</span><span class="p">(</span><span class="nx">item</span><span class="p">){</span>

                        <span class="c1">//if the new data of field less than the old, drop it</span>

                        <span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">nlist</span><span class="p">){</span>

                            <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">c</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="nx">c</span><span class="o"><</span><span class="nx">nlist</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span><span class="nx">c</span><span class="o">++</span><span class="p">)</span>

                            <span class="k">if</span><span class="p">(</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]]</span><span class="o">&&</span><span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="o"><</span><span class="nx">item</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]].</span><span class="nx">length</span><span class="p">)</span><span class="k">delete</span> <span class="nx">data</span><span class="p">[</span><span class="nx">nlist</span><span class="p">[</span><span class="nx">c</span><span class="p">]];</span>

                        <span class="p">})([</span><span class="s1">'title'</span><span class="p">,</span><span class="s1">'article'</span><span class="p">,</span><span class="s1">'tags'</span><span class="p">,</span><span class="s1">'keywords'</span><span class="p">]);</span>

 

                        <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">update</span><span class="p">({</span><span class="s1">'_id'</span><span class="p">:</span><span class="nx">item</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]},{</span><span class="na">$set</span><span class="p">:</span><span class="nx">data</span><span class="p">},</span> <span class="p">{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span><span class="nx">result</span><span class="p">)</span> <span class="p">{</span>

                            <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">)</span> <span class="p">{</span>

                                <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span>

                                <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'update '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' to mongodb, '</span> <span class="o">+</span> <span class="nx">data</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' --override-> '</span> <span class="o">+</span> <span class="nx">item</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>

                            <span class="p">}</span>

                            <span class="nx">callback</span><span class="p">();</span>

                        <span class="p">});</span>

                    <span class="p">}</span><span class="k">else</span><span class="p">{</span>

                        <span class="nx">data</span><span class="p">[</span><span class="s1">'simplefp'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">simplefp</span><span class="p">;</span>

                        <span class="nx">data</span><span class="p">[</span><span class="s1">'_id'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">_id</span><span class="p">;</span>

                        <span class="nx">data</span><span class="p">[</span><span class="s1">'created'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">currentTime</span><span class="p">;</span>

                        <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">mongoTable</span><span class="p">.</span><span class="nx">insert</span><span class="p">(</span><span class="nx">data</span><span class="p">,{</span><span class="na">w</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span> <span class="kd">function</span><span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">result</span><span class="p">)</span> <span class="p">{</span>

                            <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nx">err</span><span class="p">){</span>

                                <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">reportdb</span><span class="p">.</span><span class="nx">rpush</span><span class="p">(</span><span class="s1">'queue:crawled'</span><span class="p">,</span> <span class="nx">_id</span><span class="p">);</span>

                                <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="s1">'insert '</span><span class="o">+</span><span class="nx">data</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="s1">' to mongodb'</span><span class="p">);</span>

                            <span class="p">}</span>

                            <span class="nx">callback</span><span class="p">();</span>

                        <span class="p">});</span>

                    <span class="p">}</span>

                <span class="p">}</span>

            <span class="p">});</span>

        <span class="p">}</span><span class="k">else</span><span class="p">{</span>

            <span class="nx">logger</span><span class="p">.</span><span class="nx">warn</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span><span class="o">+</span><span class="s1">' is lack of content, drop it'</span><span class="p">);</span>

            <span class="nx">callback</span><span class="p">();</span>

        <span class="p">}</span>

    <span class="p">}</span>

<span class="p">}</span>

爬虫的抓取并发数调整

修改 instance / 你的实例 / settings.json 中的 spider_concurrency. 注意: 这里配置的是爬虫的并发请求数, 每种网页的重复抓取周期是在规则配置界面设置的.

链接摘取, 内容摘取过程定制化

有时候通过 web 界面配置的规则并不能满足一些特殊的抓取需求, 比如说一个页面抓取下来以后你要发起一个 ajax 子请求合并数据. 又比如说你要用自己的方法去摘取链接和内容. 将 instance / 你的实例 / spider_extend.js 中的 extract 方法去掉, 爬虫用内容的方法摘取完内容后会调用该函数, 传入两个参数, extracted_info 是抓取的信息, 包含了摘取到的数据, callback 是要求你完成你的动作后回调的函数, extracted_info 的结构你可以 console.dir(extracted_info) 或者 Webstorm IDE 内断点调试以下就能看到. 最后你必须调用回调函数 callback, 并且将摘取信息作为参数, 摘取信息的结构必须和传入的 extracted_info 一致, 实际上建议你直接在 extracted_info 上改动, 将其作为参数返回. 以下代码仅供参考:

<span class="cm">/** * DIY extract, it happens after spider framework extracted data. * @param extracted_info * { "signal":CMD_SIGNAL_CRAWL_SUCCESS, "content":'...', "remote_proxy":'...', "cost":122, "inject_jquery":true, "js_result":[], "drill_link":{"urllib_alias":[]}, "drill_count":0, "cookie":[], "url":'', "status":200, "origin":{ "url":link, "type":'branch/node', "referer":'', "url_pattern":'...', "save_page":true, "cookie":[], "jshandle":true, "inject_jquery":true, "drill_rules":[], "script":[], "navigate_rule":[], "stoppage":-1, "start_time":1234 } }; * @returns callback({*}) */</span> <span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">extract</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span> <span class="kd">var</span> <span class="nx">self</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span> <span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">__getTopLevelDomain</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="kd">var</span> <span class="nx">result</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">;</span> <span class="k">switch</span><span class="p">(</span><span class="nx">domain</span><span class="p">){</span> <span class="k">case</span> <span class="s1">'sino-manager.com'</span><span class="p">:</span> <span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">].</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">(</span><span class="sr">.</span><span class="se">{31})</span><span class="sr">/</span><span class="p">,</span><span class="s2">"$1s"</span><span class="p">);</span> <span class="p">}</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="k">case</span> <span class="s1">'chinaventure.com.cn'</span><span class="p">:</span> <span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">)</span> <span class="p">{</span> <span class="kd">var</span> <span class="nx">content</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">substring</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">length</span><span class="o">-</span><span class="mi">1</span><span class="p">));</span> <span class="kd">var</span> <span class="nx">news_url</span> <span class="o">=</span> <span class="s1">''</span><span class="p">;</span> <span class="kd">var</span> <span class="nx">detail</span> <span class="o">=</span> <span class="p">[];</span> <span class="kd">var</span> <span class="nx">list</span> <span class="o">=</span> <span class="p">[];</span> <span class="kd">var</span> <span class="nx">pages</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">content</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="nx">detail</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">content</span><span class="p">[</span><span class="nx">i</span><span class="p">].</span><span class="nx">news_url</span><span class="p">);</span> <span class="p">}</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventuredetail'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">detail</span><span class="p">;</span> <span class="kd">var</span> <span class="nx">expression</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">RegExp</span><span class="p">(</span><span class="s1">'^.*pages=([0-9]+).*$'</span><span class="p">,</span><span class="s2">"ig"</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">matched</span> <span class="o">=</span> <span class="nx">expression</span><span class="p">.</span><span class="nx">exec</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="k">if</span> <span class="p">(</span><span class="nx">matched</span><span class="p">)</span> <span class="p">{</span> <span class="nx">pages</span> <span class="o">=</span> <span class="nb">parseInt</span><span class="p">(</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">pages</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="p">}</span> <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="nx">list</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">list</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="nl">default</span><span class="p">:;</span> <span class="p">}</span> <span class="k">return</span> <span class="nx">callback</span><span class="p">(</span><span class="nx">result</span><span class="p">);</span> <span class="p">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

<span class="cm">/**

* DIY extract, it happens after spider framework extracted data.

* @param extracted_info

* {

        "signal":CMD_SIGNAL_CRAWL_SUCCESS,

        "content":'...',

        "remote_proxy":'...',

        "cost":122,

        "inject_jquery":true,

        "js_result":[],

        "drill_link":{"urllib_alias":[]},

        "drill_count":0,

        "cookie":[],

        "url":'',

        "status":200,

        "origin":{

            "url":link,

            "type":'branch/node',

            "referer":'',

            "url_pattern":'...',

            "save_page":true,

            "cookie":[],

            "jshandle":true,

            "inject_jquery":true,

            "drill_rules":[],

            "script":[],

            "navigate_rule":[],

            "stoppage":-1,

            "start_time":1234

        }

    };

* @returns callback({*})

*/</span>

<span class="nx">spider_extend</span><span class="p">.</span><span class="nx">prototype</span><span class="p">.</span><span class="nx">extract</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">,</span><span class="nx">callback</span><span class="p">){</span>

    <span class="kd">var</span> <span class="nx">self</span> <span class="o">=</span> <span class="k">this</span><span class="p">;</span>

    <span class="kd">var</span> <span class="nx">domain</span> <span class="o">=</span> <span class="nx">__getTopLevelDomain</span><span class="p">(</span><span class="nx">extracted_info</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>

    <span class="kd">var</span> <span class="nx">result</span> <span class="o">=</span> <span class="nx">extracted_info</span><span class="p">;</span>

    <span class="k">switch</span><span class="p">(</span><span class="nx">domain</span><span class="p">){</span>

        <span class="k">case</span> <span class="s1">'sino-manager.com'</span><span class="p">:</span>

            <span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">)</span> <span class="p">{</span>

               <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">].</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>

                  <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:sino-manager.com:sinolist'</span><span class="p">][</span><span class="nx">i</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">(</span><span class="sr">.</span><span class="se">{31})</span><span class="sr">/</span><span class="p">,</span><span class="s2">"$1s"</span><span class="p">);</span>

               <span class="p">}</span>

               <span class="k">break</span><span class="p">;</span>

            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>

                <span class="k">break</span><span class="p">;</span>

            <span class="p">}</span>

        <span class="k">case</span> <span class="s1">'chinaventure.com.cn'</span><span class="p">:</span>

            <span class="k">if</span> <span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'origin'</span><span class="p">].</span><span class="nx">urllib</span> <span class="o">==</span> <span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">)</span> <span class="p">{</span>

                <span class="kd">var</span> <span class="nx">content</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">substring</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nx">result</span><span class="p">[</span><span class="s1">'content'</span><span class="p">].</span><span class="nx">length</span><span class="o">-</span><span class="mi">1</span><span class="p">));</span>

                <span class="kd">var</span> <span class="nx">news_url</span> <span class="o">=</span> <span class="s1">''</span><span class="p">;</span>

                <span class="kd">var</span> <span class="nx">detail</span> <span class="o">=</span> <span class="p">[];</span>

                <span class="kd">var</span> <span class="nx">list</span> <span class="o">=</span> <span class="p">[];</span>

                <span class="kd">var</span> <span class="nx">pages</span><span class="p">;</span>

                <span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">content</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>

                    <span class="nx">detail</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">content</span><span class="p">[</span><span class="nx">i</span><span class="p">].</span><span class="nx">news_url</span><span class="p">);</span>

                <span class="p">}</span>

                <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventuredetail'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">detail</span><span class="p">;</span>

                <span class="kd">var</span> <span class="nx">expression</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">RegExp</span><span class="p">(</span><span class="s1">'^.*pages=([0-9]+).*$'</span><span class="p">,</span><span class="s2">"ig"</span><span class="p">);</span>

                <span class="kd">var</span> <span class="nx">matched</span> <span class="o">=</span> <span class="nx">expression</span><span class="p">.</span><span class="nx">exec</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>

                <span class="k">if</span> <span class="p">(</span><span class="nx">matched</span><span class="p">)</span> <span class="p">{</span>

                    <span class="nx">pages</span> <span class="o">=</span> <span class="nb">parseInt</span><span class="p">(</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span>

                    <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">].</span><span class="nx">replace</span><span class="p">(</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">matched</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="s1">'pages='</span><span class="o">+</span><span class="nx">pages</span><span class="p">);</span>

                <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>

 

                <span class="p">}</span>

                <span class="nx">logger</span><span class="p">.</span><span class="nx">debug</span><span class="p">(</span><span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>

                <span class="nx">list</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span> <span class="nx">result</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]);</span>

                <span class="nx">result</span><span class="p">[</span><span class="s1">'drill_link'</span><span class="p">][</span><span class="s1">'urllib:driller:chinaventure.com.cn:chinaventurelist'</span><span class="p">]</span> <span class="o">=</span> <span class="nx">list</span><span class="p">;</span>

                <span class="k">break</span><span class="p">;</span>

            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>

                <span class="k">break</span><span class="p">;</span>

            <span class="p">}</span>

        <span class="nl">default</span><span class="p">:;</span>

    <span class="p">}</span>

    <span class="k">return</span> <span class="nx">callback</span><span class="p">(</span><span class="nx">result</span><span class="p">);</span>

<span class="p">}</span>

五、Redis/ssdb 数据结构

理解数据结构, 有助于你熟悉整套系统进行二次开发. neocrawler 用到 4 个存储空间, driller_info_redis_db, url_info_redis_db, url_report_redis_db, proxy_info_redis_db, 可以在实例下的 settings.json 配置, 4 个空间存储的类别不同, 键名不会冲突, 可以将 4 个空间指向一个 redis/ssdb 库, 每个空间的增长量不一样, 如果使用 redis 建议将每个空间指向一个 db, 有条件的情况下一个空间一个 redis, 下面分别对 4 个空间的结构进行介绍:

driller_info_redis_db

存储了抓取规则及网址

  • driller:{domain}:{alias}

例如: driller:163.com:newslist, 大括号表示变量, 下同. hash 类型, 存储了抓取规则, 在 web 界面配置的规则存储在这里.

  • urllib:driller:{domain}:{alias}

例如: urllib:driller:163.com:newslist. list 类型, 存储了某种规则的网址队列, 爬虫发现符合抓取规则的网址时, 将其存入相应的队列, 调度器将从这些队列里摘取网址进行调度, 爬虫依据调度队列进行抓取, 整个过程循环反复.

  • queue:scheduled:all

待抓取队列, list 类型, 同一时间存在很多 urllib(参照上面一个说明), 调度器会根据爬虫的总调度限制及 queue:scheduled:all 队列长度得出当前可追加队列长度, 再根据你在 web 配置中心配置的调度权重 (priority, weight) 从每个队列中抽取相应网址放入 queue:scheduled:all, 爬虫将从 queue:scheduled:all 摘取网址进行抓取.

  • updated:driller:rule

记录抓取规则配置的版本信息. 爬虫 / 调度器对爬虫规则的变更是热感应 (实时刷新) 的, 但是不可能每次调度 (一个周期大概间隔几秒) 都将所有规则扫描重新载入, 于是就采用版本记录的方式, web 配置中心更改抓取规则后变更版本信息, 爬虫会重复检测这个键, 如果发现版本变化则重新加载抓取规则.

url_info_redis_db

该空间存储了网址信息, 抓取运行时间越长这里的数据量会越大

  • {url-md5-lowercase}

例如: 9108d6a10bd476158144186138fe0ba8, hash 类型, 记录了一个网址的详细信息, 在哪个页面被发现, 当前状态, 爬虫系统对该网址的操作轨迹 (发现 – 调度 – 抓取 – 存储 / 失败等等) 以及最后操作时间. 这些记录是调度器对二次发现网址是是否调度抓取的依据.

url_report_redis_db

该空间存储爬虫抓取报告

  • fail:urllib:driller:{domain}:{alias}

例如: fail:urllib:driller:163.com:newslist, zset 类型, 记录了抓取失败的网址.

  • stuck:urllib:driller:{domain}:{alias}

例如: stuck:urllib:driller:163.com:newslist, zset 类型, 记录了存储 (hbase) 失败的网址.

抓取失败 / 存储失败的网址可以用 tools/queue-helper.js 添加到抓取队列重新抓取.

注: 针对网络因素爬虫对于抓取失败本身已经做了重试操作, 重试次数可以在 settings.json 配置. 上面提到的抓取 / 存取失败是指爬虫多次尝试后依然失败的网址, 一般情况下是由于抓取规则不正确或者 hbase 故障引起的.

  • count:{date}

例如: count:20150203, hash 类型, 抓取行为的增量统计, 实例文件夹下 spider_extend.js 中各个定制化函数中有增量统计的语句, 默认是注释的, 打开后会做增量统计. 在 web 配置中心 Crawling Daily Report 就可以看到统计结果了.

proxy_info_redis_db

该空间存储代理 IP 相关的数据

  • proxy:public:available:3s

list 类型, 当前可用的代理 IP

抓取过于频繁,服务器返回429.这个时候需要切换代理IP了,推荐使用阿布云代理,阿布云代理IP,提供高匿代理,爬虫代理.

 

 

 

来源:https://geekspider.org/