nutch 12 源代码解析Word文档格式.docx

资源描述

nutch 12 源代码解析Word文档格式.docx

《nutch 12 源代码解析Word文档格式.docx》由会员分享，可在线阅读，更多相关《nutch 12 源代码解析Word文档格式.docx（29页珍藏版）》请在冰豆网上搜索。

nutch 12 源代码解析Word文档格式.docx

.equals（args[i]））{

3.dir=newPath（args[i+1]）;

4.i++;

5.}elseif（"

-threads"

6.threads=Integer.parseInt（args[i+1]）;

7.i++;

8.}elseif（"

-depth"

9.depth=Integer.parseInt（args[i+1]）;

10.i++;

11.}elseif（"

-topN"

12.topN=Integer.parseInt（args[i+1]）;

13.i++;

14.}elseif（"

-solr"

15.indexerName="

solr"

;

16.solrUrl=StringUtils.lowerCase（args[i+1]）;

17.i++;

18.}elseif（args[i]!

=null）{

19.rootUrlDir=newPath（args[i]）;

20.}

21.}

下面就开始进入正题了：

1.首先调用Injector的inject函数将rootUrlDir目录下的所有文件中的网址注入到crawlDb数据库。

2.根据指定的抓取深度，循环做以下三件事：

•调用Generator的generate函数，产生URL的fetchlist，并放在一个或者多个segments中。

•调用Fetcher的fetch函数，获取segs[0]所制定的segement，并Parsing，parsing操作可以是集成在了fetch这一步，也可以是单独下一步来做。

这儿值得注意的是，只取了第一个seg，而数组中有可能有多个seg，没有fetch。

•调用CrawlDb的update函数，将segments中的fetch和parsing的数据合并到数据库（crawlDb）中。

这里取的是segs数组，我觉得这儿可能是有问题的。

3.建立并合并索引。

1.//initializecrawlDb

2.injector.inject（crawlDb,rootUrlDir）;

3.inti;

4.for（i=0;

depth;

i++）{//generatenewsegment

5.Path[]segs=generator.generate（crawlDb,segments,-1,topN,System

6..currentTimeMillis（））;

7.if（segs==null）{

8.LOG.info（"

Stoppingatdepth="

+i+"

-nomoreURLstofetch."

）;

9.break;

10.}

11.fetcher.fetch（segs[0],threads,org.apache.nutch.fetcher.Fetcher.isParsing（conf））;

//fetchit

12.if（!

Fetcher.isParsing（job））{

13.parseSegment.parse（segs[0]）;

//parseit,ifneeded

14.}

15.crawlDbTool.update（crawlDb,segs,true,true）;

//updatecrawldb

16.}

17.if（i>

0）{

18.linkDbTool.invert（linkDb,segments,true,true,false）;

//invertlinks

19.//index,dedup&

merge

20.FileStatus[]fstats=fs.listStatus（segments,HadoopFSUtil.getPassDirectoriesFilter（fs））;

21.if（isSolrIndex）{

22.SolrIndexerindexer=newSolrIndexer（conf）;

23.indexer.indexSolr（solrUrl,crawlDb,linkDb,

24.Arrays.asList（HadoopFSUtil.getPaths（fstats）））;

25.}

26.else{

27.

28.DeleteDuplicatesdedup=newDeleteDuplicates（conf）;

29.if（indexes!

30.//Deleteoldindexes

31.if（fs.exists（indexes））{

32.LOG.info（"

Deletingoldindexes:

+indexes）;

33.fs.delete（indexes,true）;

34.}

35.//Deleteoldindex

36.if（fs.exists（index））{

37.LOG.info（"

Deletingoldmergedindex:

+index）;

38.fs.delete（index,true）;

39.}

40.}

41.

42.Indexerindexer=newIndexer（conf）;

43.indexer.index（indexes,crawlDb,linkDb,

44.Arrays.asList（HadoopFSUtil.getPaths（fstats）））;

45.

46.IndexMergermerger=newIndexMerger（conf）;

47.if（indexes!

48.dedup.dedup（newPath[]{indexes}）;

49.fstats=fs.listStatus（indexes,HadoopFSUtil.getPassDirectoriesFilter（fs））;

50.merger.merge（HadoopFSUtil.getPaths（fstats）,index,tmpDir）;

51.}

52.}

53.

54.}else{

55.LOG.warn（"

NoURLstofetch-checkyourseedlistandURLfilters."

56.}

57.if（LOG.isInfoEnabled（））{LOG.info（"

crawlfinished:

+dir）;

}

Nutch源代码浅析

（二）（Crawl中Inject的工作流程）

Inject操作的入口函数在org.apache.nutch.crawl.Injector中，main函数调用run函数，最后进入inject函数:

1、产生一个以随机数结尾的文件夹tempDir。

2、运行一个Hadoop的MapRedJOb（sortJob），将urls目录下的所有网址读出，并保存到3、OutputCollector<

Text,CrawlDatum>

对象中，而后将该对象序列化到tempDir文件夹下。

3、运行另外一个Hadoop的MapRedJob（mergeJob），将tempDir新产生数据和crawlDb已有的老数据合并，合并过程如果有某些网址新旧冲突，将保留已有的老数据，最终结果将保存于crawlDb下的一个随机数命名的文件夹newCrawlDb中（有别于1中的名称）。

4、重命名crawlDb下的“current”文件夹为“old”，如果“old”文件夹已经存在，将删除之，而后将newCrawlDb文件夹重名为"

current"

，删除tempDir。

publicvoidinject（PathcrawlDb,PathurlDir）throwsIOException{

SimpleDateFormatsdf=newSimpleDateFormat（"

yyyy-MM-ddHH:

mm:

ss"

longstart=System.currentTimeMillis（）;

if（LOG.isInfoEnabled（））{

LOG.info（"

Injector:

startingat"

+sdf.format（start））;

crawlDb:

+crawlDb）;

urlDir:

+urlDir）;

//默认会在当前文件夹产生类似于inject-temp-853856658的目录，数位为随机数

PathtempDir=

newPath（getConf（）.get（"

mapred.temp.dir"

）+

/inject-temp-"

Integer.toString（newRandom（）.nextInt（Integer.MAX_VALUE）））;

//maptextinputfiletoa<

url,CrawlDatum>

file

Convertinginjectedurlstocrawldbentries."

JobConfsortJob=newNutchJob（getConf（））;

sortJob.setJobName（"

inject"

//讲命令行中传入的urls目录作为sortJob的输入，其后将会读取其中的内容

FileInputFormat.addInputPath（sortJob,urlDir）;

//InjectMapper会根据urls目录下所有文件的内容，产生OutputCollector<

Text,CrawlDatum>

对象。

//其中Text表示url，而CrawlDatum对象则存放着url相关的状态信息，稍后会详加介绍。

//有两种方法设置Mapper,一中通过setMapperClass，另一种通过setMapRunnerClass，后法在Fetcher中有使用，

//优点可以以线程启动

sortJob.setMapperClass（InjectMapper.class）;

//将输出的结构存放的tempDir目录下

FileOutputFormat.setOutputPath（sortJob,tempDir）;

//采用Sequence的方式存储文件

sortJob.setOutputFormat（SequenceFileOutputFormat.class）;

//设置Key和Value的Class，和InjectMapper的map输出OutputCollector<

相对应的

//对所有Hadoop中的Job都必须符合这一规则。

sortJob.setOutputKeyClass（Text.class）;

sortJob.setOutputValueClass（CrawlDatum.class）;

sortJob.setLong（"

injector.current.time"

System.currentTimeMillis（））;

//调用Hadoop中的MapRed实现，运行sortJob

JobClient.runJob（sortJob）;

//mergewithexistingcrawldb

Merginginjectedurlsintocrawldb."

//将根据urls目录下产生的临时数据tempDir合并到crawlDb已有的数据中

//mergeJob的具体设置在CrawlDb.createJob（）中实现

JobConfmergeJob=CrawlDb.createJob（getConf（）,crawlDb）;

FileInputFormat.addInputPath（mergeJob,tempDir）;

//调用InjectReducer中的reduce函数，实现合并

mergeJob.setReducerClass（InjectReducer.class）;

JobClient.runJob（mergeJob）;

//备份crawlDb中的老数据，将新的结果替换为新的数据

CrawlDb.install（mergeJob,crawlDb）;

//cleanup

//清楚中间数据

FileSystemfs=FileSystem.get（getConf（））;

fs.delete（tempDir,true）;

longend=System.currentTimeMillis（）;

finishedat"

+sdf.format（end）+"

elapsed:

+TimingUtil.elapsedTime（start,end））;

}

而后我们来看看InjectMapper是如何实现map操作的，会发现，起始url可以有两个参数，并且在此函数中引入了urlfilter和scorefilter

//key在此没用到，而value中这是从urls文件下文件中读出的一行字符串

//输出会追加到output中，Text是url，而CrawlDatum为该url想对应的状态等信息

publicvoidmap（WritableComparablekey,Textvalue,

OutputCollector<

output,Reporterreporter）

throwsIOException{

Stringurl=value.toString（）;

//valueislineoftext

if（url!

=null&

url.trim（）.startsWith（"

））{

/*Ignorelinethatstartwith#*/

return;

//在url文件中描述一个url时，需要以其网址开头，需要以tab键隔开，

//目前支持在其后追加两种参数（可同时存在），"

nutch.score"

和"

nutch.fetchInterval"

，

//这样我们可以根据需要对某个url单独设置score和fetchInterval，

//这两个参数将会作为metadata存入CrawlDatum对象中

//iftabs:

metadatathatcouldbestored

//mustbename=valueandseparatedby/t

floatcustomScore=-1f;

intcustomInterval=interval;

Map<

String,String>

metadata=newTreeMap<

（）;

if（url.indexOf（"

/t"

）!

=-1）{

String[]splits=url.split（"

url=splits[0];

for（ints=1;

splits.length;

s++）{

//findseparationbetweennameandvalue

intindexEquals=splits[s].indexOf（"

if（indexEquals==-1）{

//skipanythingwithouta=

continue;

Stringmetaname=splits[s].substring（0,indexEquals）;

Stringmetavalue=splits[s].substring（indexEquals+1）;

if（metaname.equals（nutchScoreMDName））{

try{

customScore=Float.parseFloat（metavalue）;

catch（NumberFormatExceptionnfe）{}

elseif（metaname.equals（nutchFetchIntervalMDName））{

customInterval=Integer.parseInt（metavalue）;

elsemetadata.put（metaname,metavalue）;

url=urlNormalizers.normalize（url,URLNormalizers.SCOPE_INJECT）;

//注意，这就是urlfilter的入口点，在inject到crawlDb之前需要经历urlfilter的筛选，

//如果不符合筛选条件，将返回null，对此url将不做后续的工作

url=filters.filter（url）;

//filtertheurl

}catch（Exceptione）{

if（LOG.isWarnEnabled（））{LOG.warn（"

Skipping"

+url+"

+e）;

url=null;

=null）{//ifitpasses

value.set（url）;

//collectit

CrawlDatumdatum=newCrawlDatum（CrawlDatum.STATUS_INJECTED,customInterval）;

datum.setFetchTime（curTime）;

//nowaddthemetadata

Iterator<

String>

keysIter=metadata.keySet（）.iterator（）;

while（keysIter.hasNext（））{

Stringkeymd=keysIter.next（）;

Stringvaluemd=metadata.get（keymd）;

datum.getMetaData（）.put（newText（keymd）,newText（valuemd））;

if（customScore!

=-1）datum.setScore（customScore）;

elsedatum.setScore（scoreInjected）;

//注意，这儿是scorefilter的入口点

scfilters.injectedScore（value,datum）;

}catch（ScoringFilterExceptione）{

if（LOG.isWarnEnabled（））{

LOG.warn（"

Cannotfilterinjectedscoreforurl"

+url

usingdefault（"

+e.getMessage（）+"

）"

//将结果添加到output中

output.collect（value,datum）;

再看看InjectReducer是如何实现reduce的，简而言之，新旧合并，新旧冲突时，保留老的

//reduce顾名思义就是了处理一对多的情况，为了处理新老之间的合并而已，

//所以从函数的入参可以看出一些端倪，一个key，但一组values，输出依然是

//OutputCollector<

publicvoidreduce（Textkey,Itera

展开阅读全文