Limits of django-haystack with Solr 4.0

Question

We are planning to use django-haystack with Solr4.0 (with near real time search) for our web app, and I was wondering if anyone could advice on the limits of using haystack (when compared to using solr directly). i.e is there a performance hit/overhead of using django-haystack? We have around 3 million+ documents which would need indexing + an additional (estimated) 100k added everyday.

Ideally, I'd think we need a simple API over Solr4 - but I am finding it hard to find anything specific to python which is still actively maintained (except django-haystack ofcourse). I'd appreciate any guidance on this.

Foobie Bletch · Accepted Answer

It seems like your question could be rephrased "How has Haystack burned you?" Haystack is nice for some things, but has also caused me some headaches on the job. Here are some things I've had to code around.

You mentioned indexing. Haystack has management commands for rebuilding the index. We'll use these for nuke and pave rebuilding during testing, but for reindexing our production content we kind of hit the wall. The command would take forever, you wouldn't know where it was in terms of progress, and if it failed you were screwed and had to start all over again. We reached a point where we had too much content and it would fail reliably enough. We switched to making batches of content and reindexing these in celery tasks. We made a management command to do the batching and kick off all those tasks. This has been a little more robust in the face of partial failures and, even better, it actually finishes.The underlying task will use haystack calls to convert a database object to a solr document -- this ORMiness is nice and hasn't burned me yet. We edit our solr schema by hand, though.

The query API is okay for simple stuff. If you are doing more convoluted solr queries, you can find yourself just feeding in raw queries. This can lead to spaghetti code. We eventually pushed that raw spaghetti into request handlers in the solrconfig file instead. We still use haystack to turn on highlighting or to choose a specific request handler, but again, I'm happier when we keep it simple and we did hack on a method to add arbitrary parameters as needed.

There are other assumptions about how you want to use solr that seem to get baked in to the code. Haystack is the only open source project where I actually have some familiarity with the code. I'm not sure if that's a good thing, since it hasn't always been by choice. We have plenty of layer cake code that extends a haystack class and overrides it to do the right thing. This isn't terrible, but when you have to also copy and paste haystack code into those overridden methods then it starts being more terrible.

So... it's not perfect, but parts are handy. If you are on the fence about writing your own API anyway, using haystack might save you some trouble, especially when you want to take all your solr results and pass them back into django templates or whatever. It sounds like with the steady influx of documents, you will want to write some batch indexing jobs anyway. Start with that, prepare to be burnt a little and look at the source when that happens, it's actually pretty readable.

Limits of django-haystack with Solr 4.0

Answers (1)

Related Questions