tommy
tommy

Reputation: 139

lucene 4.10.2 calculate tf-idf for all terms in index

I would like to calculate the term frequency and the inverse document frequency (tf-idf) for all terms in index,

I couldn't find any example how to do it with latest Lucene (4.x.x).

Could you help me?

Upvotes: 3

Views: 1806

Answers (2)

tommy
tommy

Reputation: 139

for (String field : fields)
{ 
if( field.equals("contents") )
 { 
 Terms terms = fields.terms(field);
    TermsEnum termsEnum = terms.iterator(null);

 while (termsEnum.next() != null)  
        {

           // double idf = similarity.idf(termsEnum.docFreq(), docnum);

            double idf = Math.log(docnum  / termsEnum.docFreq()); // idf = log(D/dt)

            System.out.println("" + field + ":" + termsEnum.term().utf8ToString() +" fr = "+termsEnum.docFreq() + " idf=" + idf);
        }
   }
     else 
    {
     System.out.println("fin");
    }
      }

because idf(t, D) = log (N \ (d in D: t in d))

N: total number of documents in the corpus

d in D: t in d : number of documents where the term t appears

Upvotes: 0

femtoRgon
femtoRgon

Reputation: 33351

To iterate through terms in the index, you'll want to use Fields and Terms. Terms exposes the docfreq() for your idf calculation. Of course, IndexReader itself exposes the numDocs(). You can use DefaultSimilarity.idf to perform the calculations for you, rather than rolling your own.

DefaultSimilarity similarity = new DefaultSimilarity();
int docnum = reader.numDocs();
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
    Terms terms = fields.terms(field);
    TermsEnum termsEnum = terms.iterator(null);
    while (termsEnum.next() != null) {
        double idf = similarity.idf(termsEnum.docFreq(), docnum);
        System.out.println("" + field + ":" + termsEnum.term().utf8ToString() + " idf=" + idf);
    }
}

tf is only relevant to the term with regards to a specific document, so not quite sure what you are looking for there.

Upvotes: 2

Related Questions