Reputation: 9465
I'm quite new to Lucene. I have a Lucene 4.0 index and I would like to compute the nth most frequent words to create a stopword list. I found posts handling this in previous versions of Lucene such as Get highest frequency terms from Lucene index however it seems reader.terms() has been deprecated in 4.0.
How could I achieve this using Lucene 4.0?
Thanks!
Upvotes: 1
Views: 2198
Reputation: 26617
Here is an example of using HighFreqTerms
from the lucene-misc package.
Note that you can user HighFreqTerms.TotalTermFreqComparator
if you want to rank by term frequencies:
DocFreqComparator cmp = new HighFreqTerms.DocFreqComparator();
TermStats[] highFreqTerms = HighFreqTerms.getHighFreqTerms(reader, n, "text", cmp);
List<String> terms = new ArrayList<>(highFreqTerms.length);
for (TermStats ts : highFreqTerms) {
terms.add(ts.termtext.utf8ToString());
}
Upvotes: 1
Reputation: 26703
You might want to check New index statistics in Lucene 4.0 article written by Mike McCandless, one of Lucene contributors. What you're looking for is probably TermsEnum.totalTermFreq()
.
Upvotes: 1