有关Lucene的问题4影响Lucene对文档打分的四种方式.docx-资源下载

有关Lucene的问题4影响Lucene对文档打分的四种方式.docx

1、有关Lucene的问题4影响Lucene对文档打分的四种方式有关Lucene的问题(4):影响Lucene对文档打分的四种方式在索引阶段设置Document Boost和Field Boost，存储在(.nrm)文件中。如果希望某些文档和某些域比其他的域更重要，如果此文档和此域包含所要查询的词则应该得分较高，则可以在索引阶段设定文档的boost和域的boost值。这些值是在索引阶段就写入索引文件的，存储在标准化因子(.nrm)文件中，一旦设定，除非删除此文档，否则无法改变。如果不进行设定，则Document Boost和Field Boost默认为1。Document Boost及Field

2、Boost的设定方式如下：Document doc = new Document();Field f = new Field(contents, hello world, Field.Store.NO, Field.Index.ANALYZED);f.setBoost(100);doc.add(f);doc.setBoost(100);两者是如何影响Lucene的文档打分的呢？让我们首先来看一下Lucene的文档打分的公式：score(q,d) = coord(q,d) queryNorm(q) ( tf(t in d) idf(t)2 t.getBoost() norm(t,d) ) t i

3、n qDocument Boost和Field Boost影响的是norm(t, d)，其公式如下：norm(t,d) = doc.getBoost() lengthNorm(field) f.getBoost() field f in d named as t它包括三个参数： Document boost：此值越大，说明此文档越重要。 Field boost：此域越大，说明此域越重要。 lengthNorm(field) = (1.0 / Math.sqrt(numTerms)：一个域中包含的Term总数越多，也即文档越长，此值越小，文档越短，此值越大。其中第三个参数可以在自己的Simil

4、arity中影响打分，下面会论述。当然，也可以在添加Field的时候，设置Field.Index.ANALYZED_NO_NORMS或Field.Index.NOT_ANALYZED_NO_NORMS，完全不用norm，来节约空间。根据Lucene的注释，No norms means that index-time field and document boosting and field length normalization are disabled. The benefit is less memory usage as norms take up one byte of RAM pe

5、r indexed field for every document in the index, during searching. Note that once you index a given field with norms enabled, disabling norms will have no effect. 没有norms意味着索引阶段禁用了文档boost和域的boost及长度标准化。好处在于节省内存，不用在搜索阶段为索引中的每篇文档的每个域都占用一个字节来保存norms信息了。但是对norms信息的禁用是必须全部域都禁用的，一旦有一个域不禁用，则其他禁用的域也会存放默认的no

6、rms值。因为为了加快norms的搜索速度，Lucene是根据文档号乘以每篇文档的norms信息所占用的大小来计算偏移量的，中间少一篇文档，偏移量将无法计算。也即norms信息要么都保存，要么都不保存。下面几个试验可以验证norms信息的作用：试验一：Document Boost的作用public void testNormsDocBoost() throws Exception File indexDir = new File(testNormsDocBoost); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir)

7、, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document(); Field f1 = new Field(contents, common hello hello, Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); doc1.setBoost(100); writer.addDocument(

8、doc1); Document doc2 = new Document(); Field f2 = new Field(contents, common common hello, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc2.add(f2); writer.addDocument(doc2); Document doc3 = new Document(); Field f3 = new Field(contents, common common common, Field.Store.NO, Field.Index.ANALYZED

9、_NO_NORMS); doc3.add(f3); writer.addDocument(doc3); writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir); IndexSearcher searcher = new IndexSearcher(reader); TopDocs docs = searcher.search(new TermQuery(new Term(contents, common), 10); for (ScoreDoc doc : docs.scoreDocs)

10、System.out.println(docid : + doc.doc + score : + doc.score); 如果第一篇文档的域f1也为Field.Index.ANALYZED_NO_NORMS的时候，搜索排名如下：docid : 2 score : 1.2337708 docid : 1 score : 1.0073696 docid : 0 score : 0.71231794如果第一篇文档的域f1设为Field.Index.ANALYZED，则搜索排名如下：docid : 0 score : 39.889805 docid : 2 score : 0.6168854 doci

11、d : 1 score : 0.5036848试验二：Field Boost的作用如果我们觉得title要比contents要重要，可以做一下设定。public void testNormsFieldBoost() throws Exception File indexDir = new File(testNormsFieldBoost); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter

12、.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document(); Field f1 = new Field(title, common hello hello, Field.Store.NO, Field.Index.ANALYZED); f1.setBoost(100); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field(contents,

13、common common hello, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc2.add(f2); writer.addDocument(doc2); writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(Version.LUCENE_CURRE

14、NT, contents, new StandardAnalyzer(Version.LUCENE_CURRENT); Query query = parser.parse(title:common contents:common); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) System.out.println(docid : + doc.doc + score : + doc.score); 如果第一篇文档的域f1也为Field.Index.ANALYZED_NO_NORMS

15、的时候，搜索排名如下：docid : 1 score : 0.49999997 docid : 0 score : 0.35355338如果第一篇文档的域f1设为Field.Index.ANALYZED，则搜索排名如下：docid : 0 score : 19.79899 docid : 1 score : 0.49999997试验三：norms中文档长度对打分的影响public void testNormsLength() throws Exception File indexDir = new File(testNormsLength); IndexWriter writer = new

16、IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document(); Field f1 = new Field(contents, common hello hello, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc1.a

17、dd(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field(contents, common common hello hello hello hello, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc2.add(f2); writer.addDocument(doc2); writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(index

18、Dir); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, contents, new StandardAnalyzer(Version.LUCENE_CURRENT); Query query = parser.parse(title:common contents:common); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs

19、.scoreDocs) System.out.println(docid : + doc.doc + score : + doc.score); 当norms被禁用的时候，包含两个common的第二篇文档打分较高：docid : 1 score : 0.13928263 docid : 0 score : 0.09848769当norms起作用的时候，虽然包含两个common的第二篇文档，由于长度较长，因而打分较低：docid : 0 score : 0.09848769 docid : 1 score : 0.052230984试验四：norms信息要么都保存，要么都不保存的特性public

20、 void testOmitNorms() throws Exception File indexDir = new File(testOmitNorms); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document();

21、Field f1 = new Field(title, common hello hello, Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); for (int i = 0; i 10000; i+) Document doc2 = new Document(); Field f2 = new Field(contents, common common hello hello hello hello, Field.Store.NO, Field.Index.ANALYZED_NO_NO

22、RMS); doc2.add(f2); writer.addDocument(doc2); writer.close(); 当我们添加10001篇文档，所有的文档都设为Field.Index.ANALYZED_NO_NORMS的时候，我们看索引文件，发现.nrm文件只有1K，也即其中除了保持一定的格式信息，并无其他数据。当我们把第一篇文档设为Field.Index.ANALYZED，而其他10000篇文档都设为Field.Index.ANALYZED_NO_NORMS的时候，发现.nrm文件又10K，也即所有的文档都存储了norms信息，而非只有第一篇文档。在搜索语句中，设置Query Boo

23、st.在搜索中，我们可以指定，某些词对我们来说更重要，我们可以设置这个词的boost：common4 hello使得包含common的文档比包含hello的文档获得更高的分数。由于在Lucene中，一个Term定义为Field:Term，则也可以影响不同域的打分：title:common4 content:common使得title中包含common的文档比content中包含common的文档获得更高的分数。实例：public void testQueryBoost() throws Exception File indexDir = new File(TestQueryBoost); In

24、dexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field(contents, common1 hello hello, Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addD

25、ocument(doc1); Document doc2 = new Document(); Field f2 = new Field(contents, common2 common2 hello, Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir); IndexSearcher searcher = new IndexSear

26、cher(reader); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, contents, new StandardAnalyzer(Version.LUCENE_CURRENT); Query query = parser.parse(common1 common2); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) System.out.println(docid : + doc.doc + score

27、: + doc.score); 根据tf/idf，包含两个common2的第二篇文档打分较高：docid : 1 score : 0.24999999 docid : 0 score : 0.17677669如果我们输入的查询语句为：common1100 common2，则第一篇文档打分较高：docid : 0 score : 0.2499875 docid : 1 score : 0.0035353568那Query Boost是如何影响文档打分的呢？根据Lucene的打分计算公式：score(q,d) = coord(q,d) queryNorm(q) ( tf(t in d) idf(t

28、)2 t.getBoost() norm(t,d) ) t in q注：在queryNorm的部分，也有q.getBoost()的部分，但是对query向量的归一化(见向量空间模型与Lucene的打分机制继承并实现自己的SimilaritySimilariy是计算Lucene打分的最主要的类，实现其中的很多借口可以干预打分的过程。(1) float computeNorm(String field, FieldInvertState state)(2) float lengthNorm(String fieldName, int numTokens)(3) float queryNorm(fl

29、oat sumOfSquaredWeights)(4) float tf(float freq)(5) float idf(int docFreq, int numDocs)(6) float coord(int overlap, int maxOverlap)(7) float scorePayload(int docId, String fieldName, int start, int end, byte payload, int offset, int length)它们分别影响Lucene打分计算的如下部分：score(q,d) = (6)coord(q,d) (3)queryNorm(q) ( (4)tf(t in d) (5)idf(t)2 t.getBoost() (1)norm(t,d) )t in qnorm(t,d) = doc.getBoost() (2)lengthNorm(field) f.getBoost() field f in d

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？