Reputation: 4273
I have written an index with lucene, from a collection of documents. My documents have 2 fields and were added to the index like so:
Document doc = new Document();
doc.add(new TextField("Title", "I am a title", Field.Store.NO));
doc.add(new TextField("Text", "random text content", Field.Store.NO));
indexWriter.addDocument(doc);
I want to read the index and get the Term-Frequency for every (term, doc) pair.
If I only had 1 field, lets say "Text", I would use the following code:
IndexReader indexReader = ...;
Terms terms = MultiFields.getTerms(indexReader, "Text"); // get all terms of this field
TermsEnum termsIterator = terms.iterator();
BytesRef term;
// For every term in the "Text" Field:
while ((term = termsIterator.next()) != null) {
String termString = term.utf8ToString(); // The term
PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(indexReader,
"Text", term, PostingsEnum.FREQS);
int i;
// For every doc which contains the current term in the "Text" field:
while ((i = postingsEnum.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
Document doc = indexReader.document(i); // The document
int freq = postingsEnum.freq(); // Frequency of term in doc
}
}
However, since I have 2 fields ("Title" and "Text"), in order to get the total term-frequency for a (term, doc) pair, I firstly need to get every (term, doc) pair frequency for the "Title" field
and save them in memory, then get every (term, doc) pair frequency for the "Text" field
and combine them manually for each unique (term, doc) pair that was returned.
So, this method is very likely to iterate through the (term, doc) pairs more than once, because the same (term, doc) pair could exist in both "Title" and "Text" fields (if a document had the same term in his "Title" and "Text").
Is there any way with Lucene API to iterate through all fields combined instead ? (to avoid iterating through the same pairs more than once)
Upvotes: 2
Views: 702
Reputation: 1758
You have two fields and you need the frequencies of all token per document as sum of the frequencies per field and document.
Please remember that BytesRef (and Integer) implements the Comparable-interface: Your stream of tokens (TermsEnum) and each associated stream of documents (PostingEnum) are ordered.
So you have two times to merge two ordered streams. You don't have to save more than each head of the streams in memory.
Upvotes: 1