Queueing mechanism and Elasticsearch 1.4.0

Question

I have a RabbitMQ broker, on which I post different messages that will end up as documents in Elasticsearch. There are multiple consumers from the broker, which are actually different threads in a task executor assigned to an amqp inbound gateway (using spring integration and spring amqp here).

Think at the following scenario: I have created a doc in ES with the structure

{
   "field1" : "value1",
   "field2" : "value2"
}

Afterwards I send two update requests, both updating the same field, let's say field1. If I send this messages one right after another(common use case in production), my consumer threads will fetch the messages in the right order(amqp allows this), but the processing could happen in the wrong order and the later updated value could be overwritten by the first one. I will end up having wring data.

How can I make sure my data won't get corrupted? =>Having 1 single consumer thread is not enough, because if I want to scale out by adding more machines with my consuming app, I will still end up having multiple consumers. I might need ordering of messages, but having multiple machines I will probably need to create some sort of a cluster aware component, I am using SI, so this seems really hard to do in my opinion.

In pre 1.2 versions of ES, we used an external version, like a timestamp, and ES would have thrown VersionConflictException in my scenario:first update would have had version 10000 let's say, the second 10001 and if the first would have been processed first, ES would reject the request with version 10000 as it's lower than the existing one. But from the latest versions, ES guys have removed this functionality for update operations.

Gary Russell · Accepted Answer

One solution might be to use multiple queues and have a single consumer on each queue; use a hash function to always route updates to the same document to the same queue see the RabbitMQ Tutorials for the various options.

You can scale out by adding more queues (and changing your hash function).

For resiliency, consider running your consumers in Spring XD. You can have a single instance of each rabbit source (for each queue) and XD will take care of failing it over to another container node if it goes down.

Otherwise you could roll your own by having a warm standby - inbound adapters configured with auto-startup="false" and have something monitor and use a to start a new instance if the active one goes down.

EDIT:

In response to the fourth comment below.

As I said above, to scale out, you would have to change the hash function. So adding consumers automatically while running would be tricky.

You don't have to hard-code the queue names in the jar, you can use a property placeholder and fill it from properties, system properties, or an environment variable.

This solution is the simplest but does have these limitations.

You could, however, build a management app that could scale it out - stop the producer, wait for all queues to quiesce, reconfigure the consumers and restart the producer - Spring Integration provides a to start/stop adapters; you can also do it via JMX.

Alternative solutions are possible but will generally require maintaining some shared state across a cluster (perhaps using zookeeper etc), so are much more complex; and you still have to deal with race conditions (where the second update might arrive at some consumer before the first).

Queueing mechanism and Elasticsearch 1.4.0

Answers (2)

Related Questions