Lucene performance impact of returning large result sets

Question

Does anyone know the performance impact of letting Lucene (or Solr) return very long result sets instead of just the usual "top 10". We would like to return all results (which can be around 100.000 documents) from a user search and then post-process the returned document ids before returning the actual result.

Our current index contains about 10-20 million documents.

Xodarap · Accepted Answer

As spraff said, the answer to any question of the form "will X be fast enough?" is: "it depends."

I would be concerned about:

You'll trash your caches if these documents are large, especially if you have stored fields that you're retrieving.
Because of #1, you'll have tons of disk IO, which is very slow.
Lucene's performance grows with the number of returned documents. So even ignoring practical considerations like "disk is slower than RAM", it will be slower.

I don't know what you're doing, but it's possible that it could be accomplished with a custom score algorithm.

Of course, just because it will be slower to search all documents, this doesn't mean it will be too slow to be useful. Some faceting implementations do essentially get all matching documents, and these perform adequately for many people.

Lucene performance impact of returning large result sets

Answers (2)

Related Questions