有关Lucene的问题4影响Lucene对文档打分的四种方式.docx

资源描述

有关Lucene的问题4影响Lucene对文档打分的四种方式.docx

《有关Lucene的问题4影响Lucene对文档打分的四种方式.docx》由会员分享，可在线阅读，更多相关《有关Lucene的问题4影响Lucene对文档打分的四种方式.docx（21页珍藏版）》请在冰豆网上搜索。

有关Lucene的问题4影响Lucene对文档打分的四种方式.docx

有关Lucene的问题4影响Lucene对文档打分的四种方式

有关Lucene的问题（4）:

影响Lucene对文档打分的四种方式

在索引阶段设置DocumentBoost和FieldBoost，存储在（.nrm）文件中。

如果希望某些文档和某些域比其他的域更重要，如果此文档和此域包含所要查询的词则应该得分较高，则可以在索引阶段设定文档的boost和域的boost值。

这些值是在索引阶段就写入索引文件的，存储在标准化因子（.nrm）文件中，一旦设定，除非删除此文档，否则无法改变。

如果不进行设定，则DocumentBoost和FieldBoost默认为1。

DocumentBoost及FieldBoost的设定方式如下：

Documentdoc=newDocument（）;

Fieldf=newField（"contents","helloworld",Field.Store.NO,Field.Index.ANALYZED）;

f.setBoost（100）;

doc.add（f）;

doc.setBoost（100）;

两者是如何影响Lucene的文档打分的呢？

让我们首先来看一下Lucene的文档打分的公式：

score（q,d） = coord（q,d） · queryNorm（q） · ∑（tf（tind） · idf（t）2 · t.getBoost（）· norm（t,d））

tinq

DocumentBoost和FieldBoost影响的是norm（t,d），其公式如下：

norm（t,d） = doc.getBoost（） · lengthNorm（field） · ∏f.getBoost（）

fieldfindnamedast

它包括三个参数：

∙Documentboost：

此值越大，说明此文档越重要。

∙Fieldboost：

此域越大，说明此域越重要。

∙lengthNorm（field）=（1.0/Math.sqrt（numTerms））：

一个域中包含的Term总数越多，也即文档越长，此值越小，文档越短，此值越大。

其中第三个参数可以在自己的Similarity中影响打分，下面会论述。

当然，也可以在添加Field的时候，设置Field.Index.ANALYZED_NO_NORMS或Field.Index.NOT_ANALYZED_NO_NORMS，完全不用norm，来节约空间。

根据Lucene的注释，Nonormsmeansthatindex-timefieldanddocumentboostingandfieldlengthnormalizationaredisabled. ThebenefitislessmemoryusageasnormstakeuponebyteofRAMperindexedfieldforeverydocumentintheindex,duringsearching. Notethatonceyouindexagivenfieldwithnormsenabled,disablingnormswillhavenoeffect.没有norms意味着索引阶段禁用了文档boost和域的boost及长度标准化。

好处在于节省内存，不用在搜索阶段为索引中的每篇文档的每个域都占用一个字节来保存norms信息了。

但是对norms信息的禁用是必须全部域都禁用的，一旦有一个域不禁用，则其他禁用的域也会存放默认的norms值。

因为为了加快norms的搜索速度，Lucene是根据文档号乘以每篇文档的norms信息所占用的大小来计算偏移量的，中间少一篇文档，偏移量将无法计算。

也即norms信息要么都保存，要么都不保存。

下面几个试验可以验证norms信息的作用：

试验一：

DocumentBoost的作用

publicvoidtestNormsDocBoost（）throwsException{

FileindexDir=newFile（"testNormsDocBoost"）;

IndexWriterwriter=newIndexWriter（FSDirectory.open（indexDir）,newStandardAnalyzer（Version.LUCENE_CURRENT）,true,IndexWriter.MaxFieldLength.LIMITED）;

writer.setUseCompoundFile（false）;

Documentdoc1=newDocument（）;

Fieldf1=newField（"contents","commonhellohello",Field.Store.NO,Field.Index.ANALYZED）;

doc1.add（f1）;

doc1.setBoost（100）;

writer.addDocument（doc1）;

Documentdoc2=newDocument（）;

Fieldf2=newField（"contents","commoncommonhello",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS）;

doc2.add（f2）;

writer.addDocument（doc2）;

Documentdoc3=newDocument（）;

Fieldf3=newField（"contents","commoncommoncommon",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS）;

doc3.add（f3）;

writer.addDocument（doc3）;

writer.close（）;

IndexReaderreader=IndexReader.open（FSDirectory.open（indexDir））;

IndexSearchersearcher=newIndexSearcher（reader）;

TopDocsdocs=searcher.search（newTermQuery（newTerm（"contents","common"））,10）;

for（ScoreDocdoc:

docs.scoreDocs）{

System.out.println（"docid:

"+doc.doc+"score:

"+doc.score）;

}

如果第一篇文档的域f1也为Field.Index.ANALYZED_NO_NORMS的时候，搜索排名如下：

docid:

2score:

1.2337708

docid:

1score:

1.0073696

docid:

0score:

0.71231794

如果第一篇文档的域f1设为Field.Index.ANALYZED，则搜索排名如下：

docid:

0score:

39.889805

docid:

2score:

0.6168854

docid:

1score:

0.5036848

试验二：

FieldBoost的作用

如果我们觉得title要比contents要重要，可以做一下设定。

publicvoidtestNormsFieldBoost（）throwsException{

FileindexDir=newFile（"testNormsFieldBoost"）;

IndexWriterwriter=newIndexWriter（FSDirectory.open（indexDir）,newStandardAnalyzer（Version.LUCENE_CURRENT）,true,IndexWriter.MaxFieldLength.LIMITED）;

writer.setUseCompoundFile（false）;

Documentdoc1=newDocument（）;

Fieldf1=newField（"title","commonhellohello",Field.Store.NO,Field.Index.ANALYZED）;

f1.setBoost（100）;

doc1.add（f1）;

writer.addDocument（doc1）;

Documentdoc2=newDocument（）;

Fieldf2=newField（"contents","commoncommonhello",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS）;

doc2.add（f2）;

writer.addDocument（doc2）;

writer.close（）;

IndexReaderreader=IndexReader.open（FSDirectory.open（indexDir））;

IndexSearchersearcher=newIndexSearcher（reader）;

QueryParserparser=newQueryParser（Version.LUCENE_CURRENT,"contents",newStandardAnalyzer（Version.LUCENE_CURRENT））;

Queryquery=parser.parse（"title:

commoncontents:

common"）;

TopDocsdocs=searcher.search（query,10）;

for（ScoreDocdoc:

docs.scoreDocs）{

System.out.println（"docid:

"+doc.doc+"score:

"+doc.score）;

}

如果第一篇文档的域f1也为Field.Index.ANALYZED_NO_NORMS的时候，搜索排名如下：

docid:

1score:

0.49999997

docid:

0score:

0.35355338

如果第一篇文档的域f1设为Field.Index.ANALYZED，则搜索排名如下：

docid:

0score:

19.79899

docid:

1score:

0.49999997

试验三：

norms中文档长度对打分的影响

publicvoidtestNormsLength（）throwsException{

FileindexDir=newFile（"testNormsLength"）;

IndexWriterwriter=newIndexWriter（FSDirectory.open（indexDir）,newStandardAnalyzer（Version.LUCENE_CURRENT）,true,IndexWriter.MaxFieldLength.LIMITED）;

writer.setUseCompoundFile（false）;

Documentdoc1=newDocument（）;

Fieldf1=newField（"contents","commonhellohello",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS）;

doc1.add（f1）;

writer.addDocument（doc1）;

Documentdoc2=newDocument（）;

Fieldf2=newField（"contents","commoncommonhellohellohellohello",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS）;

doc2.add（f2）;

writer.addDocument（doc2）;

writer.close（）;

IndexReaderreader=IndexReader.open（FSDirectory.open（indexDir））;

IndexSearchersearcher=newIndexSearcher（reader）;

QueryParserparser=newQueryParser（Version.LUCENE_CURRENT,"contents",newStandardAnalyzer（Version.LUCENE_CURRENT））;

Queryquery=parser.parse（"title:

commoncontents:

common"）;

TopDocsdocs=searcher.search（query,10）;

for（ScoreDocdoc:

docs.scoreDocs）{

System.out.println（"docid:

"+doc.doc+"score:

"+doc.score）;

}

当norms被禁用的时候，包含两个common的第二篇文档打分较高：

docid:

1score:

0.13928263

docid:

0score:

0.09848769

当norms起作用的时候，虽然包含两个common的第二篇文档，由于长度较长，因而打分较低：

docid:

0score:

0.09848769

docid:

1score:

0.052230984

试验四：

norms信息要么都保存，要么都不保存的特性

publicvoidtestOmitNorms（）throwsException{

FileindexDir=newFile（"testOmitNorms"）;

IndexWriterwriter=newIndexWriter（FSDirectory.open（indexDir）,newStandardAnalyzer（Version.LUCENE_CURRENT）,true,IndexWriter.MaxFieldLength.LIMITED）;

writer.setUseCompoundFile（false）;

Documentdoc1=newDocument（）;

Fieldf1=newField（"title","commonhellohello",Field.Store.NO,Field.Index.ANALYZED）;

doc1.add（f1）;

writer.addDocument（doc1）;

for（inti=0;i<10000;i++）{

Documentdoc2=newDocument（）;

Fieldf2=newField（"contents","commoncommonhellohellohellohello",Field.Store.NO,Field.Index.ANALYZED_NO_NORMS）;

doc2.add（f2）;

writer.addDocument（doc2）;

}

writer.close（）;

}

当我们添加10001篇文档，所有的文档都设为Field.Index.ANALYZED_NO_NORMS的时候，我们看索引文件，发现.nrm文件只有1K，也即其中除了保持一定的格式信息，并无其他数据。

当我们把第一篇文档设为Field.Index.ANALYZED，而其他10000篇文档都设为Field.Index.ANALYZED_NO_NORMS的时候，发现.nrm文件又10K，也即所有的文档都存储了norms信息，而非只有第一篇文档。

在搜索语句中，设置QueryBoost.

在搜索中，我们可以指定，某些词对我们来说更重要，我们可以设置这个词的boost：

common^4hello

使得包含common的文档比包含hello的文档获得更高的分数。

由于在Lucene中，一个Term定义为Field:

Term，则也可以影响不同域的打分：

title:

common^4content:

common

使得title中包含common的文档比content中包含common的文档获得更高的分数。

实例：

publicvoidtestQueryBoost（）throwsException{

FileindexDir=newFile（"TestQueryBoost"）;

IndexWriterwriter=newIndexWriter（FSDirectory.open（indexDir）,newStandardAnalyzer（Version.LUCENE_CURRENT）,true,IndexWriter.MaxFieldLength.LIMITED）;

Documentdoc1=newDocument（）;

Fieldf1=newField（"contents","common1hellohello",Field.Store.NO,Field.Index.ANALYZED）;

doc1.add（f1）;

writer.addDocument（doc1）;

Documentdoc2=newDocument（）;

Fieldf2=newField（"contents","common2common2hello",Field.Store.NO,Field.Index.ANALYZED）;

doc2.add（f2）;

writer.addDocument（doc2）;

writer.close（）;

IndexReaderreader=IndexReader.open（FSDirectory.open（indexDir））;

IndexSearchersearcher=newIndexSearcher（reader）;

QueryParserparser=newQueryParser（Version.LUCENE_CURRENT,"contents",newStandardAnalyzer（Version.LUCENE_CURRENT））;

Queryquery=parser.parse（"common1common2"）;

TopDocsdocs=searcher.search（query,10）;

for（ScoreDocdoc:

docs.scoreDocs）{

System.out.println（"docid:

"+doc.doc+"score:

"+doc.score）;

}

根据tf/idf，包含两个common2的第二篇文档打分较高：

docid:

1score:

0.24999999

docid:

0score:

0.17677669

如果我们输入的查询语句为：

"common1^100common2"，则第一篇文档打分较高：

docid:

0score:

0.2499875

docid:

1score:

0.0035353568

那QueryBoost是如何影响文档打分的呢？

根据Lucene的打分计算公式：

score（q,d） = coord（q,d） · queryNorm（q） ·∑（tf（tind） · idf（t）2 · t.getBoost（）· norm（t,d））

tinq

注：

在queryNorm的部分，也有q.getBoost（）的部分，但是对query向量的归一化（见向量空间模型与Lucene的打分机制[

继承并实现自己的Similarity

Similariy是计算Lucene打分的最主要的类，实现其中的很多借口可以干预打分的过程。

（1）floatcomputeNorm（Stringfield,FieldInvertStatestate）

（2）floatlengthNorm（StringfieldName,intnumTokens）

（3）floatqueryNorm（floatsumOfSquaredWeights）

（4）floattf（floatfreq）

（5）floatidf（intdocFreq,intnumDocs）

（6）floatcoord（intoverlap,intmaxOverlap）

（7）floatscorePayload（intdocId,StringfieldName,intstart,intend,byte[]payload,intoffset,intlength）

它们分别影响Lucene打分计算的如下部分：

score（q,d） = （6）coord（q,d） · （3）queryNorm（q） ·∑（（4）tf（tind） · （5）idf（t）2 · t.getBoost（）·

（1）norm（t,d））

tinq

norm（t,d） = doc.getBoost（） ·

（2）lengthNorm（field） · ∏f.getBoost（）

fieldfind

展开阅读全文