5Nutch代码阅读.docx
《5Nutch代码阅读.docx》由会员分享,可在线阅读,更多相关《5Nutch代码阅读.docx(96页珍藏版)》请在冰豆网上搜索。
5Nutch代码阅读
1Nutch工作流程
1)建立初始URL集
2)将URL集注入crawldb数据库---inject
3)根据crawldb数据库创建抓取列表---generate
4)执行抓取,获取网页信息---fetch
5)更新数据库,把获取到的页面信息存入数据库中---updatedb
6)重复进行3~5的步骤,直到预先设定的抓取深度。
---这个循环过程被称为“产生/抓取/更新”循环
7)根据sengments的内容更新linkdb数据库---LinkDb
8)建立索引---Indexer
9)用户通过用户接口进行查询操作
10)将用户查询转化为lucene查询
11)返回结果
其中,1~6属于爬虫部分;7、8属于索引部分;9~11属于查询部分
2Nutch数据集的基本组成
2.1crawldb:
爬行数据库,用来存储所要爬行的网址
2.2segments:
抓取的网址被作为一个单元,而一个segment就是一个单元。
一个segment包括以下几个子目录:
2.2.1crawl_generate:
包含所抓取的网址列表
2.2.2crawl_fetch:
包含每个抓取页面的状态
2.2.3content:
包含每个抓取页面的内容
2.2.4parse_text:
包含每个抓取页面的解析文本
2.2.5parse_data:
包含每个页面的外部链接和元数据
2.2.6crawl_parse:
包含网址的外部链接地址,用于更新crawldb数据库
2.3linkdb:
链接数据库,用来存储每个网址的链接地址,包括源地址和链接地址
2.4indexes:
采用Lucene的格式建立索引集
2.5index:
?
3Nutch分析方法和工具
以下分析都是执行urls-dir1crawled-depth2-threads4-topN20后的结果分析
3.1Crawdb分析
3.1.1查看概要
heyi@heyi-PC/cygdrive/d/nutch-1.2_new
$bin/nutchreaddb1crawled/crawldb/-stats
CrawlDbstatisticsstart:
1crawled/crawldb/
StatisticsforCrawlDb:
1crawled/crawldb/
TOTALurls:
24
retry0:
24
minscore:
0.026
avgscore:
0.118875
maxscore:
1.333
status1(db_unfetched):
3
status2(db_fetched):
21
CrawlDbstatistics:
done
3.1.2把内容dump出来看看
$bin/nutchreaddb1crawled/crawldb/-dumpcrawldb_dump
CrawlDbdump:
starting
CrawlDbdb:
1crawled/crawldb/
CrawlDbdump:
done
$more*
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
22CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
1.3333334
Signature:
3dbbb785e082201813d53ec4ba7be28a
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
00CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
3c453e109c74e605284d61e5e2fc70ad
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
13CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
28fb7fbac69a66a0bceb2674f377ef90
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
26CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
70995f0eb00bc8e153765fe5bae4dc19
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
31CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
378627e7517e264ea27d7d0e74c950b3
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
42CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
02438da3a457b38a80229e64d8c27909
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
04CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
6fc20c5aafa485a65e7a74a4a19b01f8
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
21CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
8a0a290038d07706c8bfd2bc63f488dd
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
39CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.048245616
Signature:
4b3f7e8effd5740de2f5ee1ff8ffdd4f
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
23CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.04385965
Signature:
2075a99894392384877e38f93fcd3d5d
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
48CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
396a1d8b06f6264897e98f55271887f4
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
09CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
c01ea2329fafce6bbf60b0d2473a62f9
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
19CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
8cd4147b713a809079fce1b89743a5e9
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
1(db_unfetched)
Fetchtime:
WedApr2417:
21:
25CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.02631579
Signature:
null
Metadata:
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
11CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.04385965
Signature:
57791dc2e777119b78db53eef98453ee
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
50CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.04385965
Signature:
e3bc420cd8218add38992391adda0cfc
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
58CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.04385965
Signature:
3c3a64ae8e48162ebbb3aa06488828de
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
37CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.04385965
Signature:
48db854d816c5e843d7b3e21b028727d
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
40CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
b0df676f1a3756c0f08d26f0b1af2f38
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
1(db_unfetched)
Fetchtime:
WedApr2417:
21:
25CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.02631579
Signature:
null
Metadata:
Version:
7
Status:
1(db_unfetched)
Fetchtime:
WedApr2417:
21:
25CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.02631579
Signature:
null
Metadata:
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
03CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
41869be3ce4600f2bf698ff167433413
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
22:
06CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.0877193
Signature:
afefd0c30c512eb38fc31f0fdd2b226a
Metadata:
_pst_:
success
(1),lastModified=0
Version:
7
Status:
2(db_fetched)
Fetchtime:
FriMay2417:
21:
57CST2013
Modifiedtime:
ThuJan0108:
00:
00CST1970
Retriessincefetch:
0
Retryinterval:
2592000seconds(30days)
Score:
0.048245616
Signature:
406fdfeb8e14100d3b286347b0a1918d
Metadata:
_pst_:
success
(1),lastModified=0
3.2Segments分析
3.2.1查看概要
$bin/nutchreadseg-list-dir1crawled/segments/
NAMEGENERATEDFETCHERSTARTFETCHERENDFETCHEDPARSED
2013042417211612013-04-24T17:
21:
222013-04-24T17:
21:
2211
20130424172128202013-04-24T17:
21:
312013-04-24T17:
22:
262020
3.2.2把内容dump出来看