Bjorn
Bjorn

Reputation: 1110

Lucene performance impact of returning large result sets

Does anyone know the performance impact of letting Lucene (or Solr) return very long result sets instead of just the usual "top 10". We would like to return all results (which can be around 100.000 documents) from a user search and then post-process the returned document ids before returning the actual result.

Our current index contains about 10-20 million documents.

Upvotes: 3

Views: 2350

Answers (2)

Satyanarayana Kakollu
Satyanarayana Kakollu

Reputation: 21

I was able to get 100,000 rows back in 2.5 sec with 27 million documents indexed (each doc has 1k bytes with about 600B of text fields). The hardware is not ordinary it had 128 GB of RAM. Memory usage by Solr was like this: Res was 50GB Virt was 106GB.

I started seeing performance degradation after going to 80 million documents. Currently looking to investigate how to match the hardware to the problem. Hope that helps you.

Upvotes: 2

Xodarap
Xodarap

Reputation: 11849

As spraff said, the answer to any question of the form "will X be fast enough?" is: "it depends."

I would be concerned about:

  1. You'll trash your caches if these documents are large, especially if you have stored fields that you're retrieving.
  2. Because of #1, you'll have tons of disk IO, which is very slow.
  3. Lucene's performance grows with the number of returned documents. So even ignoring practical considerations like "disk is slower than RAM", it will be slower.

I don't know what you're doing, but it's possible that it could be accomplished with a custom score algorithm.

Of course, just because it will be slower to search all documents, this doesn't mean it will be too slow to be useful. Some faceting implementations do essentially get all matching documents, and these perform adequately for many people.

Upvotes: 2

Related Questions