Krt_Malta
Krt_Malta

Reputation: 9465

Get most frequent words from Lucene 4.0 index

I'm quite new to Lucene. I have a Lucene 4.0 index and I would like to compute the nth most frequent words to create a stopword list. I found posts handling this in previous versions of Lucene such as Get highest frequency terms from Lucene index however it seems reader.terms() has been deprecated in 4.0.

How could I achieve this using Lucene 4.0?

Thanks!

Upvotes: 1

Views: 2198

Answers (2)

bcoughlan
bcoughlan

Reputation: 26617

Here is an example of using HighFreqTerms from the lucene-misc package.

Note that you can user HighFreqTerms.TotalTermFreqComparator if you want to rank by term frequencies:

DocFreqComparator cmp = new HighFreqTerms.DocFreqComparator();
TermStats[] highFreqTerms = HighFreqTerms.getHighFreqTerms(reader, n, "text", cmp);

List<String> terms = new ArrayList<>(highFreqTerms.length);
for (TermStats ts : highFreqTerms) {
    terms.add(ts.termtext.utf8ToString());
}

Upvotes: 1

mindas
mindas

Reputation: 26703

You might want to check New index statistics in Lucene 4.0 article written by Mike McCandless, one of Lucene contributors. What you're looking for is probably TermsEnum.totalTermFreq().

Upvotes: 1

Related Questions