Hugo Zaragoza
Hugo Zaragoza

Reputation: 592

SOLR: size of term dictionary and how to prune it

Two related questions:

Q1. I would like to find out the term dictionary size (in number of terms) of a core.

One thing I do know how to do is to list the file size of *.tim. For example:

> du -ch *.tim | tail -1
1,3G    total

But how can I convert this to number of terms? Even a rough estimate would suffice.

Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?

More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.

Upvotes: 1

Views: 686

Answers (2)

marco
marco

Reputation: 111

  1. You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
  2. To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.

Upvotes: 1

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

  1. You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info: enter image description here

Or you can use Luke which allows you to look inside the Lucene index.

  1. It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.

Upvotes: 1

Related Questions