inverted_index
inverted_index

Reputation: 2437

How to count all tokens count in an collection/index

I use Lucene 5.3.1 and I've already indexed some documents and now am trying to find a built-in function to count all tokens count (across the collection/index)

I know that I can iterate over all documents and make a sum on their length. But because of my complex algorithms that increases run time, I'm trying to avoid this approach. I think lucene maybe have an api for this...

After all, I googled this function (or any similar function), But I cannot find any useful link.

Now the question is: Is there any built-in function which returns number of ALL TOKENS in collection (i.e. whole index) ?? If not, Is there any other optimum approach?

Any help is appreciated, thanks.

Upvotes: 0

Views: 206

Answers (1)

inverted_index
inverted_index

Reputation: 2437

Eventually I found the solution.

I use CollectionStatistics in the following way:

CollectionStatistics collectionStats = indexSearcher.collectionStatistics("Body");
long token_count = collectionStats.sumTotalTermFreq();

sumTotalTermFreq() method returns ALL TOKENS in the collection. It's fix for any query.

Upvotes: 1

Related Questions