How to get the Term-Doc frequency from many fields combined?

Question

I have written an index with lucene, from a collection of documents. My documents have 2 fields and were added to the index like so:

Document doc = new Document();
doc.add(new TextField("Title", "I am a title", Field.Store.NO));
doc.add(new TextField("Text", "random text content", Field.Store.NO));
indexWriter.addDocument(doc);

I want to read the index and get the Term-Frequency for every (term, doc) pair.

If I only had 1 field, lets say "Text", I would use the following code:

IndexReader indexReader = ...;
Terms terms = MultiFields.getTerms(indexReader, "Text"); // get all terms of this field
TermsEnum termsIterator = terms.iterator();
BytesRef term;
// For every term in the "Text" Field:
while ((term = termsIterator.next()) != null) {
    String termString = term.utf8ToString(); // The term
    PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(indexReader,
        "Text", term, PostingsEnum.FREQS);
    int i;
    // For every doc which contains the current term in the "Text" field:
    while ((i = postingsEnum.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
        Document doc = indexReader.document(i); // The document
        int freq = postingsEnum.freq(); // Frequency of term in doc
    }
}

However, since I have 2 fields ("Title" and "Text"), in order to get the total term-frequency for a (term, doc) pair, I firstly need to get every (term, doc) pair frequency for the "Title" field and save them in memory, then get every (term, doc) pair frequency for the "Text" field and combine them manually for each unique (term, doc) pair that was returned.

So, this method is very likely to iterate through the (term, doc) pairs more than once, because the same (term, doc) pair could exist in both "Title" and "Text" fields (if a document had the same term in his "Title" and "Text").

Is there any way with Lucene API to iterate through all fields combined instead ? (to avoid iterating through the same pairs more than once)

How to get the Term-Doc frequency from many fields combined?

Answers (1)

Related Questions