Luca Mastrostefano
Luca Mastrostefano

Reputation: 3271

Efficient positional query on small documents with Lucene

I have a large dataset composed by billion of small documents (~200 char/doc). What is the most efficient way to execute a positional query and get only the best three documents?

My idea is not to create a positional index and execute such a query on the whole dataset, but to build a positional index on the fly with the results of a simple boolean query and then execute the positional query to get the best three document that i need.

So, instead of: billion of docs -> build a positinal index -> execute positional query -> get best three docs

I would like to do the following: billion of docs -> build a normal index -> execute boolean query -> get the best 250 (high number) -> build with the result an in-ram positional index -> execute positional query -> get best three docs.

I think that by doing so i will reduce search time by paying a small approximation. Is thare any other/better solution to do that?

Upvotes: 0

Views: 295

Answers (2)

fatih
fatih

Reputation: 1395

Agreed with femtoRgon. If the same terms occur in your positional queries you could also think about caching the (sub-)results of your positional queries.

Lets say you use SpanQuery objects you could introduce a CachingSpanQuery class by yourself which store the resulting Spans somehow. For efficiency you need a compressed way to store the position informations.

Upvotes: 0

femtoRgon
femtoRgon

Reputation: 33341

I would try using a search filter. Perhaps a TermsFilter might be adequate, but fairly certainly, a QueryWrapperFilter. These can be wrapped with a CachingWrapperFilter, if it would be beneficial to cache the results of the filter.

When passed to your IndexSearcher.search call, this will restrict the query to searching those documents accepted by the filter.

Since you have included the tag, filtered queries can be used in solr as well, using the fq parameter.

Upvotes: 1

Related Questions