version conflict, current version [2] is different than the one provided [1]

Question

I have a Kafka topic and a Spark application. The Spark application gets data from Kafka topic, pre aggregates it and stores it in Elastic Search. Sounds simple, right?

Everything works fine as expected, but the minute I set "spark.cores" property something other than 1, I start getting

version conflict, current version [2] is different than the one provided [1]

After researching a bit, I think the error is because multiple cores can have same document at the same time and thus, when one core is done with aggregation on its part and tries to write back to the document, it gets this error

TBH, I am a bit surprised by this behaviour because I thought Spark and ES would handle this on their own. This leads me to believe that maybe, there is something wrong with my approach.

How can I fix this? Is there some sort of "synchronized" or "lock" sort of concept that I need to follow?

Cheers!

alina · Accepted Answer

I would like to answer my own question. In my use case, I was updating the document counter. So, all I had to do was retry whenever a conflict arise because I just needed to aggregate my counter.

My use case was somewhat this:

For many uses of partial update, it doesn’t matter that a document has been changed. For instance, if two processes are both incrementing the page-view counter, it doesn’t matter in which order it happens; if a conflict occurs, the only thing we need to do is reattempt the update.

This can be done automatically by setting the retry_on_conflict parameter to the number of times that update should retry before failing; it defaults to 0.

Thanks to Willis and this blog, I was able to configure Elastic Search settings and now I am not having any problems at all

version conflict, current version [2] is different than the one provided [1]

Answers (2)

Related Questions