Lemur索引文件格式分析Word格式文档下载.docx
《Lemur索引文件格式分析Word格式文档下载.docx》由会员分享,可在线阅读,更多相关《Lemur索引文件格式分析Word格式文档下载.docx(4页珍藏版)》请在冰豆网上搜索。
write()).................................................................2III.CompressedCollection..........................................................................................3A.文件列表及描述.......................................................................................................4B.文件格式(参见CompressedCollection:
addDocument())..................................4
I.RepositoryA.文件列表及描述
FileNameIndex(Directory)collection(Directory)deletedmanifestDescriptionContainszeroormoreDiskIndexinstances,innumberedsubdirectories.ContainsaCompressedCollectioninstance.Bitmapcontainingalistofalldeleteddocuments.XMLfilestoringconfigurationinformationaboutthecollection,includingindexcounts,stemmerandstopwordinformation.
II.DiskIndexA.文件列表及描述
FileNamedirectFiledocumentLengthsdocumentStatisticsfieldsFilefrequentIDfrequentStringfrequentTermsDescriptionNumericrepresentationofeachdocumentinthecollection,usefulforqueryexpansionLengthofeachdocumentinwords,4bytesperdocumentOffsetofeachdocumentinthedirectFile,documentlengthinthedirectFile,documenttermlength,numberofuniquetermsineachdocumentInvertedextentlistforfieldsBulkTreestoringthemappingfromtermIDtotermStringandtermstatisticsforfrequenttermsBulkTreestoringthemappingfromtermStringtotermIDandtermstatisticsforfrequenttermsListoftermID,termStringpairs,notstoredinatree,usedatindexmergetime
infrequentIDinfrequentStringinvertedFilemanifest
BulkTreestoringthemappingfromtermIDtotermStringandtermstatisticsforinfrequenttermsBulkTreestoringthemappingfromtermStringtotermIDandtermstatisticsforinfrequenttermsInvertedlistsforeachterminthecollectionXMLfilestoringimportantcollectionstatisticsandconfigurationinformation
文件格式(IndexWriter:
write()())B.文件格式(参见IndexWriter:
write())
invertedFileinvertedFileTermStatistic<
{TermInvDataOffset}TermStatistic,TermInvList>
*RVLDataLength(UINT32),[TermString(CString),TermData]TermDocCount(UInt),TermMaxDocLength(UInt),
TermDataTermTotalCount(UINT64),TermMinDocLength(UInt),TermFieldStatisticsTermFieldStatisticsTermInvListTopDocsBatchDataDocEntry
<
TermFieldTotalCount(UINT64),TermFieldElementCount(UInt)>
FiledCount
ControlByte(Byte),<
TopDocs>
?
<
BatchData>
*TopDocsCount(Int),<
DocID(DOCID_T),PositionCount(Int),DocLength(Int)>
TopDocsCountNextBatchDocID(DOCID_T),RVLDataLength(int),[<
DocEntry>
*]DocID(Int),PositionCount(Int),<
TermPosition(Int)>
PositionCount
frequentTerms,frequentID,frequentString,infrequentID,frequentStringfrequentTermsfrequentIDfrequentStringDistTermData1DiskTermData2DistTermData3<
DistTermData1>
*Map<
TermID(TERMID_T),DiskTermData2>
Map<
TermString(CString),DiskTermData3>
TermData,TermID(TERMID_T),TermString(CString),TermFilePointerTermData,TermString(CString),TermFilePointerTermData,TermID(TERMID_T),TermFilePointer
TermFilePointerinfrequentIDinfrequentStringfieldsFilefieldsFileBatchData
TermInvDataOffset(UINT64),TermInvDataLength(UINT64)Map<
TermString(),DiskTermData3>
{FieldDataOffset}ControlByte(UINT8),
*
NextBatchDocID(DOCID_T),RVLDataLength(Int),[<
*]ExtentLength(Int),
ExtentCount(Int),<
ExtentBegin(Int),DocEntryDocID(DOCID_T),ExtentOrdinal(Int)?
ExtentParent(Int)?
ExtentNumber(INT64)?
>
ExtentCountdirectFile,documentLengths,documentStatisticsdirectFile<
RVLDataLength(UINT32),{DocDirectDataOffset}[DocDirectData]>
*TermCount(Int),FieldCount(Int),
DocDirectData<
FieldExtent>
FieldCountFiledExtentFieldNumber(UINT64)documentLengthsdocumentStatistics
TermID(TERMID_T)>
TermCount,
FiledID(Int),
FilesdParentOrdinal(Int),
FieldBegin(Int),
FieldEnd(Int),
DocLength(UINT32)>
*<
DocumentData>
DocumentDataDocDirectDataOffset(UINT64),DocDirectDataLength(Int),DocIndexedLength(Int),DocTotalLength(Int),DocUniqTermCount(Int)manifestmanifestIndexType(CString),IndexBuildDate(CString),IndriDistribution(CString),CorpusStatistics,<
FieldDescription>
FiledCountCorpusStatisticsTotalDocCount(UINT64),TotalTermCount(UINT64),UniqTermCount(UINT64),DocBase(DOCID_T),FrequentTermCount(Int),MaxDocument(DOCID_T)FieldDescriptionIsNumeric(Bool),IsOrdinal(Bool),IsParental(Bool),FieldName(CString),ParseName(String)?
TotalDocCount(UInt),TotalTermCount(UINT64),FieldDataOffset(UINT64)
III.
CompressedCollectionCompressedCollection
A.文件列表及描述
FileNamelookupmanifeststorageDescriptionKeyfile(B-Tree)thatstoresthemappingbetweendocumentedandoffsetintostorageXMLfilestoringconfigurationinformationforthisCompressedCollectionStorescompressedversionofeachdocument(hereusingzlibcompressionlibrary)inthecollection,alongwithbyteoffsetsforeachwordineachdocument,andvariousdocumentmetadataaddedatindextime.Keyfile(B-Tree)thatstoresthemappingbetweenadocumentIDandametadatastringKeyfile(B-Tree)thatstoresthemappingbetweenametadatastringandoneormoredocumentIDs
forwardLookupnreverseLookupn
文件格式(CompressedCollection:
addDocument()B.文件格式(参见CompressedCollection:
addDocument())
lookuplookupMap<
DocID(DOCID_T),StorageDocOffset(UINT64)>
manifestmanifestForwardMetadataList,ReverseMetadataListMetadataName(String)MetadataName(String)
ForwardMetadataListReverseMetadataListstoragestorage
{StorageDocOffset}StorageDocData>
KeyValuePair>
PairCount,<
KeyOffset(UINT32),ValueOffset(UINT32)>
PairCount,
StorageDocDataPairCount(UINT32)KeyValuePairMetadataPair
MetadataPair|TermPositionPair|TextPair|ContentPair|ContentLengthPairMetadataKey(CString),MetadataValue(Void*)TermPositionFlag(CString),<
TermBegin(Int),TermLength(Int)>
TermPositionPairTextPair
TextFlag(CString),TextData(Void*)
ContentPair
ContentFlag(CString),ContentOffset(Int)ContentLengthFlag(CString),ContentLength(Int)
ContentLengthPair
forwardLookupn,reverseLookupnforwardLookupnreverseLookupnMap<
DocID(DOCID_T),MetadataValue(Void*)>
MetadataValue(CString),DocIDList(Void*)>
尊重他人劳动,转载请注明来自[PDF转换成WROD_PDF阅读器下载:
本文【Lemur索引文件格式分析】网址: