阿布云

你所需要的,不仅仅是一个好用的代理。

二次开发的爬虫框架

阿布云 发表于

 

ebmagic 是一个开源的 Java 垂直爬虫框架,目标是简化爬虫的开发流程,让开发者专注于逻辑功能的开发。webmagic 的核心非常简单,但是覆盖爬虫的整个流程,也是很好的学习爬虫开发的材料。

webmagic 的主要特色:

  • 完全模块化的设计,强大的可扩展性。
  • 核心简单但是涵盖爬虫的全部流程,灵活而强大,也是学习爬虫入门的好材料。
  • 提供丰富的抽取页面 API。
  • 无配置,但是可通过 POJO + 注解形式实现一个爬虫。
  • 支持多线程。
  • 支持分布式。
  • 支持爬取 js 动态渲染的页面。
  • 无框架依赖,可以灵活的嵌入到项目中去。

webmagic 的架构和设计参考了以下两个项目,感谢以下两个项目的作者:

python 爬虫 scrapy https://github.com/scrapy/scrapy

Java 爬虫 Spiderman http://git.oschina.net/l-weiwei/spiderman

webmagic 的 github 地址:https://github.com/code4craft/webmagic

快速开始

使用 maven

webmagic 使用 maven 管理依赖,在项目中添加对应的依赖即可使用 webmagic:

<span class="nt"><dependency></span> <span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span> <span class="nt"><artifactId></span>webmagic-core<span class="nt"></artifactId></span> <span class="nt"><version></span>0.6.1<span class="nt"></version></span> <span class="nt"></dependency></span> <span class="nt"><dependency></span> <span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span> <span class="nt"><artifactId></span>webmagic-extension<span class="nt"></artifactId></span> <span class="nt"><version></span>0.6.1<span class="nt"></version></span> <span class="nt"></dependency></span>

1

2

3

4

5

6

7

8

9

10

<span class="nt"><dependency></span>

    <span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span>

    <span class="nt"><artifactId></span>webmagic-core<span class="nt"></artifactId></span>

    <span class="nt"><version></span>0.6.1<span class="nt"></version></span>

<span class="nt"></dependency></span>

<span class="nt"><dependency></span>

    <span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span>

    <span class="nt"><artifactId></span>webmagic-extension<span class="nt"></artifactId></span>

    <span class="nt"><version></span>0.6.1<span class="nt"></version></span>

<span class="nt"></dependency></span>

WebMagic 使用 slf4j-log4j12 作为 slf4j 的实现. 如果你自己定制了 slf4j 的实现,请在项目中去掉此依赖。

<span class="nt"><exclusions></span> <span class="nt"><exclusion></span> <span class="nt"><groupId></span>org.slf4j<span class="nt"></groupId></span> <span class="nt"><artifactId></span>slf4j-log4j12<span class="nt"></artifactId></span> <span class="nt"></exclusion></span> <span class="nt"></exclusions></span>

1

2

3

4

5

6

<span class="nt"><exclusions></span>

    <span class="nt"><exclusion></span>

        <span class="nt"><groupId></span>org.slf4j<span class="nt"></groupId></span>

        <span class="nt"><artifactId></span>slf4j-log4j12<span class="nt"></artifactId></span>

    <span class="nt"></exclusion></span>

<span class="nt"></exclusions></span>

项目结构

webmagic 主要包括两个包:

  • webmagic-core

    webmagic 核心部分,只包含爬虫基本模块和基本抽取器。webmagic-core 的目标是成为网页爬虫的一个教科书般的实现。

  • webmagic-extension

    webmagic 的扩展模块,提供一些更方便的编写爬虫的工具。包括注解格式定义爬虫、JSON、分布式等支持。

webmagic 还包含两个可用的扩展包,因为这两个包都依赖了比较重量级的工具,所以从主要包中抽离出来,这些包需要下载源码后自己编译::

  • webmagic-saxon

    webmagic 与 Saxon 结合的模块。Saxon 是一个 XPath、XSLT 的解析工具,webmagic 依赖 Saxon 来进行 XPath2.0 语法解析支持。

  • webmagic-selenium

    webmagic 与 Selenium 结合的模块。Selenium 是一个模拟浏览器进行页面渲染的工具,webmagic 依赖 Selenium 进行动态页面的抓取。

在项目中,你可以根据需要依赖不同的包。

不使用 maven

在项目的 lib 目录下,有依赖的所有 jar 包,直接在 IDE 里 import 即可。

第一个爬虫

定制 PageProcessor

PageProcessor 是 webmagic-core 的一部分,定制一个 PageProcessor 即可实现自己的爬虫逻辑。以下是抓取 osc 博客的一段代码:

<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlogPageProcesser</span> <span class="kd">implements</span> <span class="n">PageProcessor</span> <span class="o">{</span> <span class="kd">private</span> <span class="n">Site</span> <span class="n">site</span> <span class="o">=</span> <span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">().</span><span class="na">setDomain</span><span class="o">(</span><span class="s">"my.oschina.net"</span><span class="o">);</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="kt">void</span> <span class="nf">process</span><span class="o">(</span><span class="n">Page</span> <span class="n">page</span><span class="o">)</span> <span class="o">{</span> <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">links</span> <span class="o">=</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">links</span><span class="o">().</span><span class="na">regex</span><span class="o">(</span><span class="s">"http://my\\.oschina\\.net/flashsword/blog/\\d+"</span><span class="o">).</span><span class="na">all</span><span class="o">();</span> <span class="n">page</span><span class="o">.</span><span class="na">addTargetRequests</span><span class="o">(</span><span class="n">links</span><span class="o">);</span> <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"title"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span> <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"content"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="err">$</span><span class="o">(</span><span class="s">"div.content"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span> <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"tags"</span><span class="o">,</span><span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">).</span><span class="na">all</span><span class="o">());</span> <span class="o">}</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="n">Site</span> <span class="nf">getSite</span><span class="o">()</span> <span class="o">{</span> <span class="k">return</span> <span class="n">site</span><span class="o">;</span> <span class="o">}</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span> <span class="n">Spider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="k">new</span> <span class="n">OschinaBlogPageProcesser</span><span class="o">()).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">)</span> <span class="o">.</span><span class="na">addPipeline</span><span class="o">(</span><span class="k">new</span> <span class="n">ConsolePipeline</span><span class="o">()).</span><span class="na">run</span><span class="o">();</span> <span class="o">}</span> <span class="o">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlogPageProcesser</span> <span class="kd">implements</span> <span class="n">PageProcessor</span> <span class="o">{</span>

 

    <span class="kd">private</span> <span class="n">Site</span> <span class="n">site</span> <span class="o">=</span> <span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">().</span><span class="na">setDomain</span><span class="o">(</span><span class="s">"my.oschina.net"</span><span class="o">);</span>

 

    <span class="nd">@Override</span>

    <span class="kd">public</span> <span class="kt">void</span> <span class="nf">process</span><span class="o">(</span><span class="n">Page</span> <span class="n">page</span><span class="o">)</span> <span class="o">{</span>

        <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">links</span> <span class="o">=</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">links</span><span class="o">().</span><span class="na">regex</span><span class="o">(</span><span class="s">"http://my\\.oschina\\.net/flashsword/blog/\\d+"</span><span class="o">).</span><span class="na">all</span><span class="o">();</span>

        <span class="n">page</span><span class="o">.</span><span class="na">addTargetRequests</span><span class="o">(</span><span class="n">links</span><span class="o">);</span>

        <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"title"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span>

        <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"content"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="err">$</span><span class="o">(</span><span class="s">"div.content"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span>

        <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"tags"</span><span class="o">,</span><span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">).</span><span class="na">all</span><span class="o">());</span>

    <span class="o">}</span>

 

    <span class="nd">@Override</span>

    <span class="kd">public</span> <span class="n">Site</span> <span class="nf">getSite</span><span class="o">()</span> <span class="o">{</span>

        <span class="k">return</span> <span class="n">site</span><span class="o">;</span>

 

    <span class="o">}</span>

 

    <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>

        <span class="n">Spider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="k">new</span> <span class="n">OschinaBlogPageProcesser</span><span class="o">()).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">)</span>

             <span class="o">.</span><span class="na">addPipeline</span><span class="o">(</span><span class="k">new</span> <span class="n">ConsolePipeline</span><span class="o">()).</span><span class="na">run</span><span class="o">();</span>

    <span class="o">}</span>

<span class="o">}</span>

这里通过 page.addTargetRequests() 方法来增加要抓取的 URL,并通过 page.putField() 来保存抽取结果。page.getHtml().xpath() 则是按照某个规则对结果进行抽取,这里抽取支持链式调用。调用结束后,toString() 表示转化为单个 String,all() 则转化为一个 String 列表。

Spider 是爬虫的入口类。Pipeline 是结果输出和持久化的接口,这里 ConsolePipeline 表示结果输出到控制台。

执行这个 main 方法,即可在控制台看到抓取结果。webmagic 默认有 3 秒抓取间隔,请耐心等待。

使用注解

webmagic-extension 包括了注解方式编写爬虫的方法,只需基于一个 POJO 增加注解即可完成一个爬虫。以下仍然是抓取 oschina 博客的一段代码,功能与 OschinaBlogPageProcesser 完全相同:

<span class="nd">@TargetUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog/\\d+"</span><span class="o">)</span> <span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlog</span> <span class="o">{</span> <span class="nd">@ExtractBy</span><span class="o">(</span><span class="s">"//title"</span><span class="o">)</span> <span class="kd">private</span> <span class="n">String</span> <span class="n">title</span><span class="o">;</span> <span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"div.BlogContent"</span><span class="o">,</span><span class="n">type</span> <span class="o">=</span> <span class="n">ExtractBy</span><span class="o">.</span><span class="na">Type</span><span class="o">.</span><span class="na">Css</span><span class="o">)</span> <span class="kd">private</span> <span class="n">String</span> <span class="n">content</span><span class="o">;</span> <span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">,</span> <span class="n">multi</span> <span class="o">=</span> <span class="kc">true</span><span class="o">)</span> <span class="kd">private</span> <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">tags</span><span class="o">;</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span> <span class="n">OOSpider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span> <span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">(),</span> <span class="k">new</span> <span class="nf">ConsolePageModelPipeline</span><span class="o">(),</span> <span class="n">OschinaBlog</span><span class="o">.</span><span class="na">class</span><span class="o">).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">).</span><span class="na">run</span><span class="o">();</span> <span class="o">}</span> <span class="o">}</span>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

<span class="nd">@TargetUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog/\\d+"</span><span class="o">)</span>

<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlog</span> <span class="o">{</span>

 

    <span class="nd">@ExtractBy</span><span class="o">(</span><span class="s">"//title"</span><span class="o">)</span>

    <span class="kd">private</span> <span class="n">String</span> <span class="n">title</span><span class="o">;</span>

 

    <span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"div.BlogContent"</span><span class="o">,</span><span class="n">type</span> <span class="o">=</span> <span class="n">ExtractBy</span><span class="o">.</span><span class="na">Type</span><span class="o">.</span><span class="na">Css</span><span class="o">)</span>

    <span class="kd">private</span> <span class="n">String</span> <span class="n">content</span><span class="o">;</span>

 

    <span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">,</span> <span class="n">multi</span> <span class="o">=</span> <span class="kc">true</span><span class="o">)</span>

    <span class="kd">private</span> <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">tags</span><span class="o">;</span>

 

    <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>

        <span class="n">OOSpider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span>

            <span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">(),</span>

            <span class="k">new</span> <span class="nf">ConsolePageModelPipeline</span><span class="o">(),</span> <span class="n">OschinaBlog</span><span class="o">.</span><span class="na">class</span><span class="o">).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">).</span><span class="na">run</span><span class="o">();</span>

    <span class="o">}</span>

<span class="o">}</span>

这个例子定义了一个 Model 类,Model 类的字段’title’、’content’、’tags’均为要抽取的属性。这个类在 Pipeline 里是可以复用的。

详细文档

http://webmagic.io/docs/

 抓取过于频繁,服务器返回429.这个时候需要切换代理IP了,推荐使用阿布云代理,阿布云代理IP,提供高匿代理,爬虫代理.