你所需要的,不仅仅是一个好用的代理。
ebmagic 是一个开源的 Java 垂直爬虫框架,目标是简化爬虫的开发流程,让开发者专注于逻辑功能的开发。webmagic 的核心非常简单,但是覆盖爬虫的整个流程,也是很好的学习爬虫开发的材料。
webmagic 的主要特色:
webmagic 的架构和设计参考了以下两个项目,感谢以下两个项目的作者:
python 爬虫 scrapy https://github.com/scrapy/scrapy
Java 爬虫 Spiderman http://git.oschina.net/l-weiwei/spiderman
webmagic 的 github 地址:https://github.com/code4craft/webmagic。
webmagic 使用 maven 管理依赖,在项目中添加对应的依赖即可使用 webmagic:
<span class="nt"><dependency></span> <span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span> <span class="nt"><artifactId></span>webmagic-core<span class="nt"></artifactId></span> <span class="nt"><version></span>0.6.1<span class="nt"></version></span> <span class="nt"></dependency></span> <span class="nt"><dependency></span> <span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span> <span class="nt"><artifactId></span>webmagic-extension<span class="nt"></artifactId></span> <span class="nt"><version></span>0.6.1<span class="nt"></version></span> <span class="nt"></dependency></span>
1
2
3
4
5
6
7
8
9
10
<span class="nt"><dependency></span>
<span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>webmagic-core<span class="nt"></artifactId></span>
<span class="nt"><version></span>0.6.1<span class="nt"></version></span>
<span class="nt"></dependency></span>
<span class="nt"><dependency></span>
<span class="nt"><groupId></span>us.codecraft<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>webmagic-extension<span class="nt"></artifactId></span>
<span class="nt"><version></span>0.6.1<span class="nt"></version></span>
<span class="nt"></dependency></span>
WebMagic 使用 slf4j-log4j12 作为 slf4j 的实现. 如果你自己定制了 slf4j 的实现,请在项目中去掉此依赖。
<span class="nt"><exclusions></span> <span class="nt"><exclusion></span> <span class="nt"><groupId></span>org.slf4j<span class="nt"></groupId></span> <span class="nt"><artifactId></span>slf4j-log4j12<span class="nt"></artifactId></span> <span class="nt"></exclusion></span> <span class="nt"></exclusions></span>
1
2
3
4
5
6
<span class="nt"><exclusions></span>
<span class="nt"><exclusion></span>
<span class="nt"><groupId></span>org.slf4j<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>slf4j-log4j12<span class="nt"></artifactId></span>
<span class="nt"></exclusion></span>
<span class="nt"></exclusions></span>
webmagic 主要包括两个包:
webmagic 核心部分,只包含爬虫基本模块和基本抽取器。webmagic-core 的目标是成为网页爬虫的一个教科书般的实现。
webmagic 的扩展模块,提供一些更方便的编写爬虫的工具。包括注解格式定义爬虫、JSON、分布式等支持。
webmagic 还包含两个可用的扩展包,因为这两个包都依赖了比较重量级的工具,所以从主要包中抽离出来,这些包需要下载源码后自己编译::
webmagic 与 Saxon 结合的模块。Saxon 是一个 XPath、XSLT 的解析工具,webmagic 依赖 Saxon 来进行 XPath2.0 语法解析支持。
webmagic 与 Selenium 结合的模块。Selenium 是一个模拟浏览器进行页面渲染的工具,webmagic 依赖 Selenium 进行动态页面的抓取。
在项目中,你可以根据需要依赖不同的包。
在项目的 lib 目录下,有依赖的所有 jar 包,直接在 IDE 里 import 即可。
PageProcessor 是 webmagic-core 的一部分,定制一个 PageProcessor 即可实现自己的爬虫逻辑。以下是抓取 osc 博客的一段代码:
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlogPageProcesser</span> <span class="kd">implements</span> <span class="n">PageProcessor</span> <span class="o">{</span> <span class="kd">private</span> <span class="n">Site</span> <span class="n">site</span> <span class="o">=</span> <span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">().</span><span class="na">setDomain</span><span class="o">(</span><span class="s">"my.oschina.net"</span><span class="o">);</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="kt">void</span> <span class="nf">process</span><span class="o">(</span><span class="n">Page</span> <span class="n">page</span><span class="o">)</span> <span class="o">{</span> <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">links</span> <span class="o">=</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">links</span><span class="o">().</span><span class="na">regex</span><span class="o">(</span><span class="s">"http://my\\.oschina\\.net/flashsword/blog/\\d+"</span><span class="o">).</span><span class="na">all</span><span class="o">();</span> <span class="n">page</span><span class="o">.</span><span class="na">addTargetRequests</span><span class="o">(</span><span class="n">links</span><span class="o">);</span> <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"title"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span> <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"content"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="err">$</span><span class="o">(</span><span class="s">"div.content"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span> <span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"tags"</span><span class="o">,</span><span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">).</span><span class="na">all</span><span class="o">());</span> <span class="o">}</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="n">Site</span> <span class="nf">getSite</span><span class="o">()</span> <span class="o">{</span> <span class="k">return</span> <span class="n">site</span><span class="o">;</span> <span class="o">}</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span> <span class="n">Spider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="k">new</span> <span class="n">OschinaBlogPageProcesser</span><span class="o">()).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">)</span> <span class="o">.</span><span class="na">addPipeline</span><span class="o">(</span><span class="k">new</span> <span class="n">ConsolePipeline</span><span class="o">()).</span><span class="na">run</span><span class="o">();</span> <span class="o">}</span> <span class="o">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlogPageProcesser</span> <span class="kd">implements</span> <span class="n">PageProcessor</span> <span class="o">{</span>
<span class="kd">private</span> <span class="n">Site</span> <span class="n">site</span> <span class="o">=</span> <span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">().</span><span class="na">setDomain</span><span class="o">(</span><span class="s">"my.oschina.net"</span><span class="o">);</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">process</span><span class="o">(</span><span class="n">Page</span> <span class="n">page</span><span class="o">)</span> <span class="o">{</span>
<span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">links</span> <span class="o">=</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">links</span><span class="o">().</span><span class="na">regex</span><span class="o">(</span><span class="s">"http://my\\.oschina\\.net/flashsword/blog/\\d+"</span><span class="o">).</span><span class="na">all</span><span class="o">();</span>
<span class="n">page</span><span class="o">.</span><span class="na">addTargetRequests</span><span class="o">(</span><span class="n">links</span><span class="o">);</span>
<span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"title"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span>
<span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"content"</span><span class="o">,</span> <span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="err">$</span><span class="o">(</span><span class="s">"div.content"</span><span class="o">).</span><span class="na">toString</span><span class="o">());</span>
<span class="n">page</span><span class="o">.</span><span class="na">putField</span><span class="o">(</span><span class="s">"tags"</span><span class="o">,</span><span class="n">page</span><span class="o">.</span><span class="na">getHtml</span><span class="o">().</span><span class="na">xpath</span><span class="o">(</span><span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">).</span><span class="na">all</span><span class="o">());</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="n">Site</span> <span class="nf">getSite</span><span class="o">()</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">site</span><span class="o">;</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">Spider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="k">new</span> <span class="n">OschinaBlogPageProcesser</span><span class="o">()).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">)</span>
<span class="o">.</span><span class="na">addPipeline</span><span class="o">(</span><span class="k">new</span> <span class="n">ConsolePipeline</span><span class="o">()).</span><span class="na">run</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
这里通过 page.addTargetRequests() 方法来增加要抓取的 URL,并通过 page.putField() 来保存抽取结果。page.getHtml().xpath() 则是按照某个规则对结果进行抽取,这里抽取支持链式调用。调用结束后,toString() 表示转化为单个 String,all() 则转化为一个 String 列表。
Spider 是爬虫的入口类。Pipeline 是结果输出和持久化的接口,这里 ConsolePipeline 表示结果输出到控制台。
执行这个 main 方法,即可在控制台看到抓取结果。webmagic 默认有 3 秒抓取间隔,请耐心等待。
webmagic-extension 包括了注解方式编写爬虫的方法,只需基于一个 POJO 增加注解即可完成一个爬虫。以下仍然是抓取 oschina 博客的一段代码,功能与 OschinaBlogPageProcesser 完全相同:
<span class="nd">@TargetUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog/\\d+"</span><span class="o">)</span> <span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlog</span> <span class="o">{</span> <span class="nd">@ExtractBy</span><span class="o">(</span><span class="s">"//title"</span><span class="o">)</span> <span class="kd">private</span> <span class="n">String</span> <span class="n">title</span><span class="o">;</span> <span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"div.BlogContent"</span><span class="o">,</span><span class="n">type</span> <span class="o">=</span> <span class="n">ExtractBy</span><span class="o">.</span><span class="na">Type</span><span class="o">.</span><span class="na">Css</span><span class="o">)</span> <span class="kd">private</span> <span class="n">String</span> <span class="n">content</span><span class="o">;</span> <span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">,</span> <span class="n">multi</span> <span class="o">=</span> <span class="kc">true</span><span class="o">)</span> <span class="kd">private</span> <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">tags</span><span class="o">;</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span> <span class="n">OOSpider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span> <span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">(),</span> <span class="k">new</span> <span class="nf">ConsolePageModelPipeline</span><span class="o">(),</span> <span class="n">OschinaBlog</span><span class="o">.</span><span class="na">class</span><span class="o">).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">).</span><span class="na">run</span><span class="o">();</span> <span class="o">}</span> <span class="o">}</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<span class="nd">@TargetUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog/\\d+"</span><span class="o">)</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">OschinaBlog</span> <span class="o">{</span>
<span class="nd">@ExtractBy</span><span class="o">(</span><span class="s">"//title"</span><span class="o">)</span>
<span class="kd">private</span> <span class="n">String</span> <span class="n">title</span><span class="o">;</span>
<span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"div.BlogContent"</span><span class="o">,</span><span class="n">type</span> <span class="o">=</span> <span class="n">ExtractBy</span><span class="o">.</span><span class="na">Type</span><span class="o">.</span><span class="na">Css</span><span class="o">)</span>
<span class="kd">private</span> <span class="n">String</span> <span class="n">content</span><span class="o">;</span>
<span class="nd">@ExtractBy</span><span class="o">(</span><span class="n">value</span> <span class="o">=</span> <span class="s">"//div[@class='BlogTags']/a/text()"</span><span class="o">,</span> <span class="n">multi</span> <span class="o">=</span> <span class="kc">true</span><span class="o">)</span>
<span class="kd">private</span> <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">tags</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">OOSpider</span><span class="o">.</span><span class="na">create</span><span class="o">(</span>
<span class="n">Site</span><span class="o">.</span><span class="na">me</span><span class="o">(),</span>
<span class="k">new</span> <span class="nf">ConsolePageModelPipeline</span><span class="o">(),</span> <span class="n">OschinaBlog</span><span class="o">.</span><span class="na">class</span><span class="o">).</span><span class="na">addUrl</span><span class="o">(</span><span class="s">"http://my.oschina.net/flashsword/blog"</span><span class="o">).</span><span class="na">run</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
这个例子定义了一个 Model 类,Model 类的字段’title’、’content’、’tags’均为要抽取的属性。这个类在 Pipeline 里是可以复用的。