Josh Handel
Josh Handel

Reputation: 1780

Lucene search taking TOOO long

I;m using Lucene.net (2.9.2.2) on a (currently) 70Gig index.. I can do a fairly complicated search and get all the document IDs back in 1 ~ 2 seconds.. But to actually load up all the hits (about 700 thousand in my test queries) takes 5+ minutes.

We aren't using lucene for UI, this is a datastore between processes where we have hundreds of millions of pre-cached data elements, and the part I am working on exports a few specific fields from each found document. (ergo, pagination doesn't make since as this is an export between processes).

My question is what is the best way to get all of the documents in a search result? currently I am using a custom collector that does a get on the document (with a MapFieldSelector) as its collecting.. I've also tried iterating through the list after the collector has finished.. but that was even worse.

I'm open to ideas :-).

Thanks in advance.

Upvotes: 3

Views: 1011

Answers (2)

Yuval F
Yuval F

Reputation: 20621

What fields do you need to search? What fields do you need to store? Lucene.net is probably not the most efficient way to store and retrieve the actual document texts. Your scenario suggests not storing anything, indexing the needed fields and returning a list of document ids. The documents themselves can be stored in an auxiliary database.

Upvotes: 2

Adrian Conlon
Adrian Conlon

Reputation: 3941

Hmmm, given that you've found problems when your "get" code was moved outside the collector, it sounds like your problem is I/O related.

I'm almost dreading asking this given the size of your index, but have you tried:

  • Optimising the index
  • De-fragmenting your hard disk

If so, was there a noticeable effect on the rate documents are retrieved? BTW, I get 2333 items/second retrieved, if my shaky maths is correct...

Also, for the subset of fields you're retrieving, are any of them amenable to compression? Or have you already experimented with compression?

As a related matter, what kind of proportion of your index does 700 thousand items represent? It'd be interesting to get a feel for I/O throughput. You could probably work out the maximum theoretical data rate for your machine/hard-drive combination and see whether you're already close to the limit.

Upvotes: 0

Related Questions