Lucene 4.x performance issues

Question

Over the last few weeks I've been working on upgrading an application from Lucene 3.x to Lucene 4.x in hopes of improving performance. Unfortunately, after going through the full migration process and playing with all sorts of tweaks I found online and in the documentation, Lucene 4 is running significantly slower than Lucene 3 (~50%). I'm pretty much out of ideas at this point, and was wondering if anyone else had any suggestions on how to bring it up to speed. I'm not even looking for a big improvement over 3.x anymore; I'd be happy to just match it and stay on a current release of Lucene.

In order to confirm that none of the standard migration changes had a negative effect on performance, I ported my Lucene 4.x version back to Lucene 3.6.2 and kept the newer API rather than using the custom ParallelMultiSearcher and other deprecated methods/classes.

Performance in 3.6.2 is even faster than before:

Old application (Lucene 3.6.0) - ~5700 requests/min
Updated application with new API and some minor optimizations (Lucene 4.4.0) - ~2900 requests/min
New version of the application ported back, but retaining optimizations and newer IndexSearcher/etc API (Lucene 3.6.2) - ~6200 requests/min

Since the optimizations and use of the newer Lucene API actually improved performance on 3.6.2, it doesn't make sense for this to be a problem with anything but Lucene. I just don't know what else I can change in my program to fix it.

Application Information

We have one index that is broken into 20 shards - this provided the best performance in both Lucene 3.x and Lucene 4.x
The index currently contains ~150 million documents, all of which are fairly simple and heavily normalized so there are a lot of duplicate tokens. Only one field (an ID) is stored - the others are not retrievable.
We have a fixed set of relatively simple queries that are populated with user input and executed - they are comprised of multiple BooleanQueries, TermQueries and TermRangeQueries. Some of them are nested, but only a single level right now.
We're not doing anything advanced with results - we just fetch the the scores and the stored ID fields
We're using MMapDirectories pointing to index files in a tmpfs. We played with the useUnmap "hack" since we don't open new Directories very often and got a nice boost from that
We're using a single IndexSearcher for all queries
Our test machines have 94GB of RAM and 64 logical cores

General Processing

1) Request received by socket listener

2) Up to 4 Query objects are generated and populated with normalized user input (all of the required input for a query must be present or it won't be executed)

3) Queries are executed in parallel using the Fork/Join framework

Subqueries to each shard are executed in parallel using the IndexSearcher w/ExecutorService

4) Aggregation and other simple post-processing

Other Relevant Info

Indexes were recreated for the 4.x system, but the data is the same. We tried the normal Lucene42 codec as well as an extended one that didn't use compression (per a suggestion on the web)
In 3.x we used a modified version of the ParallelMultisearcher, in 4.x we're using the IndexSearcher with an ExecutorService and combining all of our readers in a MultiReader
In 3.x we used a ThreadPoolExecutor instead of Fork/Join (Fork/Join performed better in my tests)

4.x Hot Spots

Method | Self Time (%) | Self Time (ms)| Self Time (CPU in ms)

java.util.concurrent.CountDownLatch.await() | 11.29% | 140887.219 | 0.0 <- this is just from tcp threads waiting for the real work to finish - you can ignore it
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.() | 9.74% | 21594.03 | 121594
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTermsEnum$Frame.() | 9.59% | 119680.956 | 119680
org.apache.lucene.codecs.lucene41.ForUtil.readBlock() | 6.91% | 86208.621 | 86208
org.apache.lucene.search.DisjunctionScorer.heapAdjust() | 6.68% | 83332.525 | 83332
java.util.concurrent.ExecutorCompletionService.take() | 5.29% | 66081.499 | 6153
org.apache.lucene.search.DisjunctionSucorer.afterNext() | 4.93% | 61560.872 | 61560
org.apache.lucene.search.Tercorer.advance() | 4.53% | 56530.752 | 56530
java.nio.DirectByteBuffer.get() | 3.96% | 49470.349 | 49470
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.() | 2.97% | 37051.644 | 37051
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getFrame() | 2.77% | 34576.54 | 34576
org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo() | 2.47% | 30767.711 | 30767
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.newTertate() | 2.23% | 27782.522 | 27782
java.net.ServerSocket.accept() | 2.19% | 27380.696 | 0.0
org.apache.lucene.search.DisjunctionSucorer.advance() | 1.82% | 22775.325 | 22775
org.apache.lucene.search.HitQueue.getSentinelObject() | 1.59% | 19869.871 | 19869
org.apache.lucene.store.ByteBufferIndexInput.buildSlice() | 1.43% | 17861.148 | 17861
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getArc() | 1.35% | 16813.927 | 16813
org.apache.lucene.search.DisjunctionSucorer.countMatches() | 1.25% | 15603.283 | 15603
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() | 1.12% | 13929.646 | 13929
java.util.concurrent.locks.ReentrantLock.lock() | 1.05% | 13145.631 | 8618
org.apache.lucene.util.PriorityQueue.downHeap() | 1.00% | 12513.406 | 12513
java.util.TreeMap.get() | 0.89% | 11070.192 | 11070
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs() | 0.80% | 10026.117 | 10026
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.decodeMetaData() | 0.62% | 7746.05 | 7746
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader.iterator() | 0.60% | 7482.395 | 7482
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.seekExact() | 0.55% | 6863.069 | 6863
org.apache.lucene.store.DataInput.clone() | 0.54% | 6721.357 | 6721
java.nio.DirectByteBufferR.duplicate() | 0.48% | 5930.226 | 5930
org.apache.lucene.util.fst.ByteSequenceOutputs.read() | 0.46% | 5708.354 | 5708
org.apache.lucene.util.fst.FST.findTargetArc() | 0.45% | 5601.63 | 5601
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock() | 0.45% | 5567.914 | 5567
org.apache.lucene.store.ByteBufferIndexInput.toString() | 0.39% | 4889.302 | 4889
org.apache.lucene.codecs.lucene41.Lucene41SkipReader.() | 0.33% | 4147.285 | 4147
org.apache.lucene.search.TermQuery$TermWeight.scorer() | 0.32% | 4045.912 | 4045
org.apache.lucene.codecs.MultiLevelSkipListReader.() | 0.31% | 3890.399 | 3890
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() | 0.31% | 3886.194 | 3886

If there's any other information you could use that might help, please let me know.

Desidero · Accepted Answer

For anyone who cares or is trying to do something similar (controlled parallelism within a query), the problem I had was that the IndexSearcher was creating a task per segment per shard rather than a task per shard - I misread the javadoc.

I resolved the problem by using forceMerge(1) on my shards to limit the number of extra threads. In my use case this isn't a big deal since I don't currently use NRT search, but it still adds unnecessary complexity to the update + slave synchronization process, so I'm looking into ways to avoid the forceMerge.

As a quick fix, I'll probably just extend the IndexSearcher and have it spawn a thread per reader instead of a thread per segment, but the idea of a "virtual segment" was brought up in the Lucene mailing list. That would be a much better long-term fix.

If you want to see more info, you can follow the lucene mailing list thread here: http://www.mail-archive.com/java-user@lucene.apache.org/msg42961.html

Lucene 4.x performance issues

<Edit>

</Edit>

Application Information

General Processing

Other Relevant Info

4.x Hot Spots

Answers (1)

Related Questions