Reputation: 592
Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.
Upvotes: 1
Views: 686
Reputation: 111
Upvotes: 1
Reputation: 9789
Or you can use Luke which allows you to look inside the Lucene index.
Upvotes: 1