Lucene代码分析.docx-资源下载

Lucene代码分析.docx

1、Lucene代码分析1. Lucene源代码分析1首先讲一下Lucene的发音是Loo-seen，这是Lucene in Action中提到过的。另外强调的一点是我用的版本是1.9版，大家看到这里不要惊讶，我用这么早的版本，是为了能更好的了解Lucene的核心。如果有人看过最新的版本就应该了解，对于一个初学者，Lucene最高版本并不那么简单，它涉及了多线程，设计模式，对我反正是很挑战了。我先看老版本这也是受LINUX内核完全注释作者赵炯的启发，他分析的不是最新的Linux内核，而是1.11的版本。我开始还是用调试的方式来解释，我想大家和我一样，如果看了半天analyzer也会有点不耐烦，我先

2、写一个什么意义都没有例子(有那么一点意义的例子，网上很多)：package forfun;import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.index.IndexWriter;public class Test public static void main( String args ) throws Exception IndexWriter writer = new IndexWriter( E:a, new SimpleAnalyzer(), true); IndexWriter是最核心的一

3、个类，一般的Blogger把其它所有的包都分析完了，就剩这最核心的一个包的时候，就分析不动了。我们先看一下它的参数，第一个就是索引存放的路径，第二个参数是一个Analyzer对象，它对输入数据进行过滤，分词等，第三个参数如果为true，那么它删除原目录内的所有内容重建索引，如果为false，就在已经存在的索引上追加新的内容。你可以先运行一下，就会发现指定的目录下有一个segments文件。调试的时候，暂时不去管SimpleAnalyzer类。我们看一下IndexWriter类的构造函数：public IndexWriter(String path, Analyzer a, boolea

4、n create) throws IOException this(FSDirectory.getDirectory(path, create), a, create, true); 这里我们看到一个新的类FSDirectory:public static FSDirectory getDirectory(String path, boolean create) throws IOException return getDirectory(new File(path), create); 再看getDirectory函数：public static FSDirectory getDirecto

5、ry(File file, boolean create) throws IOException file = new File(file.getCanonicalPath(); FSDirectory dir; synchronized (DIRECTORIES) dir = (FSDirectory) DIRECTORIES.get(file); if (dir = null) try dir = (FSDirectory) IMPL.newInstance(); catch (Exception e) throw new RuntimeException( cannot load FSD

6、irectory class: + e.toString(); dir.init(file, create); DIRECTORIES.put(file, dir); else if (create) dir.create(); synchronized (dir) dir.refCount+; return dir; DIRECTORIES是一个Hashtable对象，DIRECTORIES注释上讲，目录的缓存，保证唯一的路径和Directory对应，所以在Directory上同步可以对读写进行同步访问。(This cache of directories ensures that ther

7、e is a unique Directory instance per path, so that synchronization on the Directory can be used to synchronize access between readers and writers.) 也懒得解释了，就是创建一下目录，最后将refCount+。我们回过头来看IndexWriter的构造函数：private IndexWriter(Directory d, Analyzer a, final boolean create, boolean closeDir) throws IOExce

8、ption this.closeDir = closeDir; directory = d; analyzer = a; Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME); if (!writeLock.obtain(WRITE_LOCK_TIMEOUT) / obtain write lock throw new IOException(Index locked for write: + writeLock); this.writeLock = writeLock; / save it synchronized

9、(directory) / in- & inter-process sync new Lock.With(directory.makeLock(IndexWriter.COMMIT_LOCK_NAME), COMMIT_LOCK_TIMEOUT) public Object doBody() throws IOException if (create) segmentInfos.write(directory); else segmentInfos.read(directory); return null; .run(); 这里让我感兴趣的是doBody中的segmentInfos.write

10、r，我们进入看一下这个函数：public final void write(Directory directory) throws IOException IndexOutput output = directory.createOutput(segments.new); try output.writeInt(FORMAT); / write FORMAT output.writeLong(+version); / every write changes the index output.writeInt(counter); / write counter output.writeInt(s

11、ize(); / write infos for (int i = 0; i = dataLen) dataLen = input.read(ioBuffer); bufferIndex = 0; ; if (dataLen = -1) if (length 0) break; else return null; else c = ioBufferbufferIndex+; if (isTokenChar(c) / if its a token char if (length = 0) / start of token start = offset - 1; bufferlength+ = n

12、ormalize(c); / buffer it, normalized if (length = MAX_WORD_LEN) / buffer overflow! break; else if (length 0) / at non-Letter w/ chars break; / return em return new Token(new String(buffer, 0, length), start, start + length); 看起来很长，其实很简单，至少读起来很简单，其中isTokenChar就是我们刚才在LetterTokenizer中看到的，代码中用start记录一个t

13、oken的起始位置，用length记录它的长度，如果不是字符的话，就break;，我们看到一个新的类Token，这里它的构造参数有字符串，起始位置，结束位置。看一下Token的源代码：String termText; / the text of the termint startOffset; / start in source textint endOffset; / end in source textString type = word; / lexical typeprivate int positionIncrement = 1;/* Constructs a Token with

14、 the given term text, and start & end offsets. The type defaults to word. */public Token(String text, int start, int end) termText = text; startOffset = start; endOffset = end;/* Constructs a Token with the given text, start and end offsets, & type. */public Token(String text, int start, int end, St

15、ring typ) termText = text; startOffset = start; endOffset = end; type = typ; 和我们刚才用到的构造函数对应一下，就知道三个成员变量的意思了，type和positionIncrement我还是引用一下别的人话，Type主要用来表示文本编码和语言类型，single表示单个ASCII字符，double表示non-ASCII字符，Word是默认的不区分的字符类型。而positionIncrement表示位置增量，用于处理拼音之类的情况(拼音就在那个词的上方)。3. Lucene源代码分析3关于TokenFilter我们先看一个

16、最简单的LowerCaseFilter，它的next函数如下：public final Token next() throws IOException Token t = input.next(); if (t = null) return null; t.termText = t.termText.toLowerCase(); return t; 没什么意思，就是把Token对象中的字符串换成了小写，你想看有意思的可以看PortStemFilter，剑桥大学出的那本Introduction to information retrieval中也提到过这种方法，34页。再看一个稍有一点意义的T

17、okenFilter，StopFilter，我们看一下public static final Set makeStopSet(String stopWords) return makeStopSet(stopWords, false);public static final Set makeStopSet(String stopWords, boolean ignoreCase) HashSet stopTable = new HashSet(stopWords.length); for (int i = 0; i stopWords.length; i+) stopTable.add(ign

18、oreCase ? stopWordsi.toLowerCase() : stopWordsi); return stopTable;public final Token next() throws IOException / return the first non-stop word found for (Token token = input.next(); token != null; token = input.next() String termText = ignoreCase ? token.termText.toLowerCase() : token.termText; if

19、 (!stopWords.contains(termText) return token; / reached EOS - return null return null; makeStopSet是把所有要过滤的词加到stopTable中去(不清楚为什么不用HashSet呢)，在next函数中，它过滤掉stopTable有的字符串。再来看一个简单的Analyzer，StopAnalyzer的next函数：public TokenStream tokenStream(String fieldName, Reader reader) return new StopFilter(new LowerCaseTokenizer(reader), stopWords); 记得这句话吗？Lucene中一个Analyzer通常由Tokenizer和TokenFilter组成，这里就是这句话的证据，我们先对reader传进来的字符串进行分词，再对它进行过滤。而其中的tokenStream当然就是我们在分词时要调用的那个函数了。4. Lucene源代码分析4 写一个略有一点意义的例子，我们把”Hello World”加入索引：package forfun;import org.apache.lucene.

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？