What caching strategy for search queries

We are developing a search engine web application that will enable users to search the content of about 200 portals.

Our business partner is taking care of maintaining and feeding a solr/lucene instance that is doing the workhorse job of indexing the data.

Our application queries solr and presents the results in a human-friendly way. However, we are wondering how we could limit the amount of queries, perhaps using some form of caching. The results could be cached for few hours.

What we are wondering is: what could be a good strategy for caching the queries results? Obviously we expect the method invocations to vary a lot... Does it make sense at all to do caching?

Is there some caching system that is particularly suitable in this use case? We are using Spring 3 for the development.

Upvotes: 2

Answers (3)

Bart Czernicki

Reputation: 3683

I have found that caching the results or the rendered content outside Lucene works best. Having an API search service that points to a caching tier with the results from a Lucene Index.

If you separate the caching tier out, you can then plug in whatever caching you want...distributed caching (Redis, Azure AppFabric, other cloud caching etc). Also you can cache the partial renderings of the web page (i.e. outputcaching in ASP.NET) or cache the API calls themselves using RESTful conventions. Things like cache-warming or proactive caching (based on usage) then are easy to do with services.

Your application/index cache then can be "re-used" across more tiers of your app instead of just caching at the index level. This all depends on how if your indexing updates are real-time, if the queries are date-level secure for each client/user id etc. As mentioned above Solr already does "some" of this stuff for you.

Upvotes: 0

pap

Reputation: 27614

I would keep in mind that Solr already has a lot of caching built into it in order to speed up common queries. I'd advise you to look into the inherent capabilities in Solr/Lucene before you go off and reinvent the wheel with your own query cache.

Here is a good place to start.

Upvotes: 3

Layke

Reputation: 53156

The simplest solution is to reform your query before it hits Solr.

I created my own QueryBuilder method, which I pass through my query string before hitting Solr.

All this does is explodes all of the arguments and then sorts them in to a predefined group set.

For example, in order to normalize your queries so that they can be cacheable, you can sort alphabetically on each key, then reform the query string, and then use this to query Solr. (The actual query result will be unchanged).

Before you actually run the query, you could then create a hash of the Solr query string and check an in memory hash of all keys that have been saved against. If you find yourself approaching millions of query keys which might be quite likely, you might want to start looking at using a BloomFilter to reduce the keyspace and still maintain some degree of accuracy on cache hits.

Alternatively, you might want to look at putting a reverse proxy cache in between you and Solr. For example, if you were to query Solr like, Spring -> Varnish -> Solr, Varnish could be used to cache and it would use the query string as a hash. You would then be able to set a 2 hour Expires, in order to have the results automatically flushed/cleared/invalidated.

Hopefully this helps.

Upvotes: 0

What caching strategy for search queries

Answers (3)

Related Questions