snakile
snakile

Reputation: 54521

Finding the most common terms in my Solr collection

I need to identify potential stopwords in my Solr collection. Is it possible to find those terms which have the highest document frequency in my collection (or at least in a given shard)?

Upvotes: 0

Views: 445

Answers (2)

femtoRgon
femtoRgon

Reputation: 33341

Yes, use HighFreqTerms, like:

TermStats[] stats = HighFreqTerms.gethighFreqTerms(reader, 10, "myContentField", new HighFreqTerms.DocFreqComparator());
for (TermStats stat : stats) {
    System.out.println(stat.termtext.utf8ToString() + ",   docfreq:" + stat.docFreq);
    //Or whatever else you want to do with them...
}

Luke also prominently displays the most common terms.

Upvotes: 1

tcao
tcao

Reputation: 431

As you already set up Solr, use TermsComponent to get the term frequencies for any given field:

http://wiki.apache.org/solr/TermsComponent

If you have a default search field, (which is the destination of your copied field), it should give you the frequencies across all fields.

Upvotes: 0

Related Questions