Pixl8ed
Pixl8ed

Reputation: 21

What's the simplest way to process lots of updates to Solr in batches?

I have a Rails app that uses Sunspot, and it is generating a high volume of individual updates which are generating unnecessary load on Solr. What is the best way to send these updates to Solr in batches?

Upvotes: 2

Views: 2524

Answers (3)

Jared
Jared

Reputation: 3005

I used a slightly different approach here:

I was already using auto_index: false and processing solr updates in the background using sidekiq. So instead of building an additional queue, I used the sidekiq-grouping gem to combine Solr update jobs into batches. Then I use Sunspot.index in the job to index the grouped objects in a single request.

Upvotes: 0

Nick Zadrozny
Nick Zadrozny

Reputation: 7944

Sunspot makes indexing a batch of documents pretty straightforward:

Sunspot.index(array_of_docs)

That will send off just the kind of batch update to Solr that you're looking for here.

The trick for your Rails app is finding the right scope for those batches of documents. Are they being created as the result of a bunch of user requests, and scattered all around your different application processes? Or do you have some batch process of your own that you control?

The sunspot_index_queue project on GitHub looks like a reasonable approach to this.

Alternatively, you can always turn off Sunspot's "auto-index" option, which fires off updates whenever your documents are updated. In your model, you can pass in auto_index: false to the searchable method.

searchable auto_index: false do
  # sunspot setup
end

Then you have a bit more freedom to control indexing in batches. You might write a standalone Rake task which iterates through all objects created and updated in the last N minutes and index them in batches of 1,000 docs or so. An infinite loop of that should stand up to a pretty solid stream of updates.

At a really large scale, you really want all your updates going through some kind of queue. Inserting your document data into a queue like Kafka or AWS Kinesis for later processing in batches by another standalone indexing process would be ideal for this at scale.

Upvotes: 1

Jayendra
Jayendra

Reputation: 52779

Assuming, the changes from the Rails apps also update a persistence store, you can check for Data Import Handler (DIH) handler which can be scheduled periodically to update Solr indexes.
So instead of each update and commits triggered on Solr, the frequency can be decided to update Solr in batches.
However, expect a latency in the search results.

Also, Are you updating the Individual records and commit ? If using Solr 4.0 you can check for Soft and Hard Commits as well.

Upvotes: 2

Related Questions