Doron Yaacoby
Doron Yaacoby

Reputation: 9760

Lucene.net: Out of memory when sorting

I have a fairly large Lucene.net index (created with the latest version - 2.9). It has ~1 billion documents. It takes ~70GB of HD space. Each document is very small, just two fields: a string and an integer.

I want to search by the string field, and sort by the index field. The thing is, I get an OutOfMemoryException when I attempt to run the query with a sort. The code looks something like this:

var sort = new Sort(new SortField("frequency",SortField.INT,false));
var topDocs = searcher.Search(query, null, 1,sort);

It doesn't matter which query I use, if I use the sort, it crashes. Here is the stack trace:

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at Lucene.Net.Search.FieldCacheImpl.IntCache.CreateValue(IndexReader reader, Entry entryKey)
at Lucene.Net.Search.FieldCacheImpl.Cache.Get(IndexReader reader, Entry key)
at Lucene.Net.Search.FieldCacheImpl.GetInts(IndexReader reader, String field, IntParser parser)
at Lucene.Net.Search.FieldCacheImpl.IntCache.CreateValue(IndexReader reader, Entry entryKey)
at Lucene.Net.Search.FieldCacheImpl.Cache.Get(IndexReader reader, Entry key)
at Lucene.Net.Search.FieldCacheImpl.GetInts(IndexReader reader, String field, IntParser parser)
at Lucene.Net.Search.FieldComparator.IntComparator.SetNextReader(IndexReader reader, Int32 docBase)
at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Collector collector)
at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Int32 nDocs, Sort sort, Boolean fillFields)
at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Int32 nDocs, Sort sort)
at Lucene.Net.Search.Searcher.Search(Query query, Filter filter, Int32 n, Sort sort)

I'm fairly new to Lucene. Looks like it is trying to cache a huge amount of data and runs out of memory.

Update: Indeed, looks like Lucene attempts to create an array int[maxDoc] which is huge if my case.

Sorting uses of caches of term values maintained by the internal HitQueue(s). The cache is static and contains an integer or float array of length IndexReader.maxDoc() for each field name for which a sort is performed. In other words, the size of the cache in bytes is: 4 * IndexReader.maxDoc() * (# of different fields actually used to sort)

Can I change this behavior somehow?

Upvotes: 1

Views: 1329

Answers (2)

Doron Yaacoby
Doron Yaacoby

Reputation: 9760

I ended up doing something different. Realizing that I always want my result sorted this way, what I really need is to influence Scoring. I rebuilt my index while using Document.SetBoost() with the value of integer parameter, and so the score of each document is dominated by the value of this field. Since the default Lucene behavior is to return the best scoring documents, I got what I needed.

Upvotes: 1

L.B
L.B

Reputation: 116118

No you can not change this behavior. But since you are only interested in the top result, you can write a custom Collector and get to top most result without sorting the whole result set (like finding the max in an integer array in O(n) time)

If you are interested in top-n results then you can use PriorityQueue. Here is my another answer showing how to use PriorityQueue and Collector

Upvotes: 1

Related Questions