Lucene代码分析.docx

资源描述

Lucene代码分析.docx

《Lucene代码分析.docx》由会员分享，可在线阅读，更多相关《Lucene代码分析.docx（18页珍藏版）》请在冰豆网上搜索。

Lucene代码分析.docx

Lucene代码分析

1.Lucene源代码分析［1］

首先讲一下Lucene的发音是Loo-seen，这是LuceneinAction中提到过的。

另外强调的一点是我用的版本是1.9版，大家看到这里不要惊讶，我用这么早的版本，是为了能更好的了解Lucene的核心。

如果有人看过最新的版本就应该了解，对于一个初学者，Lucene最高版本并不那么简单，它涉及了多线程，设计模式，对我反正是很挑战了。

我先看老版本这也是受《LINUX内核完全注释》作者赵炯的启发，他分析的不是最新的Linux内核，而是1.11的版本。

我开始还是用调试的方式来解释，我想大家和我一样，如果看了半天analyzer也会有点不耐烦，我先写一个什么意义都没有例子（有那么一点意义的例子，网上很多）：

packageforfun;

importorg.apache.lucene.analysis.SimpleAnalyzer;

importorg.apache.lucene.index.IndexWriter;

publicclassTest

{

publicstaticvoidmain（String[]args）throwsException

{

IndexWriterwriter=newIndexWriter（"E:

\\a\\",new

SimpleAnalyzer（）,true）;

}

IndexWriter是最核心的一个类，一般的Blogger把其它所有的包都分析完了，就剩这最核心的一个包的时候，就分析不动了。

我们先看一下它的参数，第一个就是索引存放的路径，第二个参数是一个Analyzer对象，它对输入数据进行过滤，分词等，第三个参数如果为true，那么它删除原目录内的所有内容重建索引，如果为false，就在已经存在的索引上追加新的内容。

你可以先运行一下，就会发现指定的目录下有一个segments文件。

调试的时候，暂时不去管SimpleAnalyzer类。

我们看一下IndexWriter类的构造函数：

publicIndexWriter（Stringpath,Analyzera,booleancreate）

throwsIOException{

this（FSDirectory.getDirectory（path,create）,a,create,true）;

}

这里我们看到一个新的类FSDirectory:

publicstaticFSDirectorygetDirectory（Stringpath,booleancreate）

throwsIOException{

returngetDirectory（newFile（path）,create）;

}

再看getDirectory函数：

publicstaticFSDirectorygetDirectory（Filefile,booleancreate）

throwsIOException{

file=newFile（file.getCanonicalPath（））;

FSDirectorydir;

synchronized（DIRECTORIES）{

dir=（FSDirectory）DIRECTORIES.get（file）;

if（dir==null）{

try{

dir=（FSDirectory）IMPL.newInstance（）;

}catch（Exceptione）{

thrownewRuntimeException（

"cannotloadFSDirectoryclass:

"+e.toString（））;

}

dir.init（file,create）;

DIRECTORIES.put（file,dir）;

}elseif（create）{

dir.create（）;

}

synchronized（dir）{

dir.refCount++;

}

returndir;

}

DIRECTORIES是一个Hashtable对象，DIRECTORIES注释上讲，目录的缓存，保证唯一的路径和Directory对应，所以在Directory上同步可以对读写进行同步访问。

（ThiscacheofdirectoriesensuresthatthereisauniqueDirectoryinstanceperpath,sothatsynchronizationontheDirectorycanbeusedtosynchronizeaccessbetweenreadersandwriters.）

也懒得解释了，就是创建一下目录，最后将refCount++。

我们回过头来看IndexWriter的构造函数：

privateIndexWriter（Directoryd,Analyzera,finalbooleancreate,

booleancloseDir）throwsIOException{

this.closeDir=closeDir;

directory=d;

analyzer=a;

LockwriteLock=directory.makeLock（IndexWriter.WRITE_LOCK_NAME）;

if（!

writeLock.obtain（WRITE_LOCK_TIMEOUT））//obtainwritelock

thrownewIOException（"Indexlockedforwrite:

"+writeLock）;

this.writeLock=writeLock;//saveit

synchronized（directory）{//in-&inter-processsync

newLock.With（directory.makeLock（

IndexWriter.COMMIT_LOCK_NAME）,

COMMIT_LOCK_TIMEOUT）{

publicObjectdoBody（）throwsIOException{

if（create）

segmentInfos.write（directory）;

else

segmentInfos.read（directory）;

returnnull;

}

}.run（）;

}

这里让我感兴趣的是doBody中的segmentInfos.writer，我们进入看一下这个函数：

publicfinalvoidwrite（Directorydirectory）throwsIOException{

IndexOutputoutput=directory.createOutput（"segments.new"）;

try{

output.writeInt（FORMAT）;//writeFORMAT

output.writeLong（++version）;//everywritechangestheindex

output.writeInt（counter）;//writecounter

output.writeInt（size（））;//writeinfos

for（inti=0;i

SegmentInfosi=info（i）;

output.writeString（si.name）;

output.writeInt（si.docCount）;

}

}finally{

output.close（）;

}

//installnewsegmentinfo

directory.renameFile（"segments.new",IndexFileNames.SEGMENTS）;

}

先看一下第一个函数，它建立了一个segments.new的文件，你如果在调试，就可以看到这个文件产生了，它返回一个IndexOutput对象，用它来写文件。

我们就不去理睬这些有什么用了，第一个FORMAT是-1，第二个version是用System.currentTimeMillis（）产生的，目的是产生唯一的一个版本号。

下面counter是0。

SegmentInfos继承自Vector，下面的size（）就是它有多少个元素，但是我们没有对任何文档建索引，所以它是空的。

最后一句话是把segments.new文件名重命名为segment。

你可以用UltraEdit或是WinHex打开segments看一下里面的内容。

我这里把它列出来：

FFFFFFFF000001221502072A0000000000000000

writeInt是写入四个字节，writeLong是八个字节，现在可以看到所写入的四个内容分别是什么了。

2.Lucene源代码分析［2］

上次提到了Analyzer类，说它是用于对输入进行过滤，分词等，现在我们详细看一个这个类，Lucene中一个Analyzer通常由Tokenizer和TokenFilter组成，我们先看一下Tokenizer：

publicabstractclassTokenizerextendsTokenStream{

/**ThetextsourceforthisTokenizer.*/

protectedReaderinput;

/**Constructatokenizerwithnullinput.*/

protectedTokenizer（）{

}

/**Constructatokenstreamprocessingthegiveninput.*/

protectedTokenizer（Readerinput）{

this.input=input;

}

/**Bydefault,closestheinputReader.*/

publicvoidclose（）throwsIOException{

input.close（）;

}

只是一个抽象类，而且也没什么值得我们关注的函数，我们看一下他的父类TokenStream:

publicabstractclassTokenStream{

/**Returnsthenexttokeninthestream,ornullatEOS.*/

publicabstractTokennext（）throwsIOException;

/**Releasesresourcesassociatedwiththisstream.*/

publicvoidclose（）throwsIOException{

}

原来值得我们关注的函数在它的父类中，next函数，它会返回流中的下一个token。

其实刚才提到的另一个类TokenFilter也继承自TokenStream：

publicabstractclassTokenFilterextendsTokenStream{

/**Thesourceoftokensforthisfilter.*/

protectedTokenStreaminput;

/**CallTokenFilter（TokenStream）instead.

*@deprecated*/

protectedTokenFilter（）{

}

/**Constructatokenstreamfilteringthegiveninput.*/

protectedTokenFilter（TokenStreaminput）{

this.input=input;

}

/**ClosetheinputTokenStream.*/

publicvoidclose（）throwsIOException{

input.close（）;

}

先写一个依然没有意义的测试类：

packageforfun;

importjava.io.BufferedReader;

importjava.io.File;

importjava.io.FileReader;

importorg.apache.lucene.analysis.LetterTokenizer;

publicclassTokenTest

{

publicstaticvoidmain（String[]args）throwsException

{

Filef=newFile（"E:

\\source.txt"）;

BufferedReaderreader=newBufferedReader（newFileReader（f））;

LetterTokenizerlt=newLetterTokenizer（reader）;

System.out.println（lt.next（））;

}

Source.txt中我写的helloworld!

。

当然你也可以写别的，我用LetterTokenizer进行分词，最后打印分词后的第一个token。

我们先看一下他是如何分词的，也就是next到底在做什么。

publicclassLetterTokenizerextendsCharTokenizer{

/**ConstructanewLetterTokenizer.*/

publicLetterTokenizer（Readerin）{

super（in）;

}

/**Collectsonlycharacterswhichsatisfy

*{@linkCharacter#isLetter（char）}.*/

protectedbooleanisTokenChar（charc）{

returnCharacter.isLetter（c）;

}

函数isTokenChar来判断c是不是一个字母，它并没有实现next函数，我们到它的父类看一下，找到了next函数：

/**Returnsthenexttokeninthestream,ornullatEOS.*/

publicfinalTokennext（）throwsIOException{

intlength=0;

intstart=offset;

while（true）{

finalcharc;

offset++;

if（bufferIndex>=dataLen）{

dataLen=input.read（ioBuffer）;

bufferIndex=0;

}

;

if（dataLen==-1）{

if（length>0）

break;

else

returnnull;

}else

c=ioBuffer[bufferIndex++];

if（isTokenChar（c））{//ifit'satokenchar

if（length==0）//startoftoken

start=offset-1;

buffer[length++]=normalize（c）;//bufferit,normalized

if（length==MAX_WORD_LEN）//bufferoverflow!

break;

}elseif（length>0）//atnon-Letterw/chars

break;//return'em

}

returnnewToken（newString（buffer,0,length）,start,

start+length）;

}

看起来很长，其实很简单，至少读起来很简单，其中isTokenChar就是我们刚才在LetterTokenizer中看到的，代码中用start记录一个token的起始位置，用length记录它的长度，如果不是字符的话，就break;，我们看到一个新的类Token，这里它的构造参数有字符串，起始位置，结束位置。

看一下Token的源代码：

StringtermText;//thetextoftheterm

intstartOffset;//startinsourcetext

intendOffset;//endinsourcetext

Stringtype="word";//lexicaltype

privateintpositionIncrement=1;

/**ConstructsaTokenwiththegiventermtext,andstart&endoffsets.

Thetypedefaultsto"word."*/

publicToken（Stringtext,intstart,intend）{

termText=text;

startOffset=start;

endOffset=end;

}

/**ConstructsaTokenwiththegiventext,startandendoffsets,&type.*/

publicToken（Stringtext,intstart,intend,Stringtyp）{

termText=text;

startOffset=start;

endOffset=end;

type=typ;

}

和我们刚才用到的构造函数对应一下，就知道三个成员变量的意思了，type和positionIncrement我还是引用一下别的人话，Type主要用来表示文本编码和语言类型，single表示单个ASCII字符，double表示non-ASCII字符，Word是默认的不区分的字符类型。

而positionIncrement表示位置增量，用于处理拼音之类的情况（拼音就在那个词的上方）。

3.Lucene源代码分析［3］

关于TokenFilter我们先看一个最简单的LowerCaseFilter，它的next函数如下：

publicfinalTokennext（）throwsIOException{

Tokent=input.next（）;

if（t==null）

returnnull;

t.termText=t.termText.toLowerCase（）;

returnt;

}

没什么意思，就是把Token对象中的字符串换成了小写，你想看有意思的可以看PortStemFilter，剑桥大学出的那本Introductiontoinformationretrieval中也提到过这种方法，34页。

再看一个稍有一点意义的TokenFilter，StopFilter，我们看一下

publicstaticfinalSetmakeStopSet（String[]stopWords）{

returnmakeStopSet（stopWords,false）;

}

publicstaticfinalSetmakeStopSet（String[]stopWords,booleanignoreCase）{

HashSetstopTable=newHashSet（stopWords.length）;

for（inti=0;i

stopTable.add（ignoreCase?

stopWords[i].toLowerCase（）

stopWords[i]）;

returnstopTable;

}

publicfinalTokennext（）throwsIOException{

//returnthefirstnon-stopwordfound

for（Tokentoken=input.next（）;token!

=null;token=input.next（））

{

StringtermText=ignoreCase?

token.termText.toLowerCase（）

token.termText;

if（!

stopWords.contains（termText））

returntoken;

}

//reachedEOS--returnnull

returnnull;

}

makeStopSet是把所有要过滤的词加到stopTable中去（不清楚为什么不用HashSet呢），在next函数中，它过滤掉stopTable有的字符串。

再来看一个简单的Analyzer，StopAnalyzer的next函数：

publicTokenStreamtokenStream（StringfieldName,Readerreader）{

returnnewStopFilter（newLowerCaseTokenizer（reader）,stopWords）;

}

记得这句话吗？

Lucene中一个Analyzer通常由Tokenizer和TokenFilter组成，这里就是这句话的证据，我们先对reader传进来的字符串进行分词，再对它进行过滤。

而其中的tokenStream当然就是我们在分词时要调用的那个函数了。

4.Lucene源代码分析［4］

写一个略有一点意义的例子，我们把”HelloWorld”加入索引：

packageforfun;

importorg.apache.lucene.

展开阅读全文