dimitris93
dimitris93

Reputation: 4273

How to get the Term-Doc frequency from many fields combined?

I have written an index with lucene, from a collection of documents. My documents have 2 fields and were added to the index like so:

Document doc = new Document();
doc.add(new TextField("Title", "I am a title", Field.Store.NO));
doc.add(new TextField("Text", "random text content", Field.Store.NO));
indexWriter.addDocument(doc);

I want to read the index and get the Term-Frequency for every (term, doc) pair.

If I only had 1 field, lets say "Text", I would use the following code:

IndexReader indexReader = ...;
Terms terms = MultiFields.getTerms(indexReader, "Text"); // get all terms of this field
TermsEnum termsIterator = terms.iterator();
BytesRef term;
// For every term in the "Text" Field:
while ((term = termsIterator.next()) != null) {
    String termString = term.utf8ToString(); // The term
    PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(indexReader,
        "Text", term, PostingsEnum.FREQS);
    int i;
    // For every doc which contains the current term in the "Text" field:
    while ((i = postingsEnum.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
        Document doc = indexReader.document(i); // The document
        int freq = postingsEnum.freq(); // Frequency of term in doc
    }
}

However, since I have 2 fields ("Title" and "Text"), in order to get the total term-frequency for a (term, doc) pair, I firstly need to get every (term, doc) pair frequency for the "Title" field and save them in memory, then get every (term, doc) pair frequency for the "Text" field and combine them manually for each unique (term, doc) pair that was returned.

So, this method is very likely to iterate through the (term, doc) pairs more than once, because the same (term, doc) pair could exist in both "Title" and "Text" fields (if a document had the same term in his "Title" and "Text").

Is there any way with Lucene API to iterate through all fields combined instead ? (to avoid iterating through the same pairs more than once)

Upvotes: 2

Views: 702

Answers (1)

Karsten R.
Karsten R.

Reputation: 1758

You have two fields and you need the frequencies of all token per document as sum of the frequencies per field and document.

Please remember that BytesRef (and Integer) implements the Comparable-interface: Your stream of tokens (TermsEnum) and each associated stream of documents (PostingEnum) are ordered.

So you have two times to merge two ordered streams. You don't have to save more than each head of the streams in memory.

Upvotes: 1

Related Questions