Understanding efficient use of Point in Time (PIT) with search_after in Elasticsearch

Question

I am currently trying to solve a problem statement. We have an index where I want to fetch all documents at once and apply an update on few fields. This will require pagination across documents during fetch process.

Context - The architecture is like this: a daily cronjob which will write on all documents (full update) once in a day. Let's say it takes 2 hours to complete. Now, in between this time we have multiple Kafka Consumers which perform writes on the same index (full update) with some different data. We do full update since our schema uses nested fields.

To solve the problem statement as I mentioned above, we can use Scroll API to fetch all documents, apply update and do bulk indexing. However, I want to take care of 2 things -

Conflict resolution - It could be possible that my cronjob would be writing on same document on which one of my Kafka consumer is writing. In this case, I would like to have optimistic locking approach so that I can retry applying the partial update on document. The issue is that - metadata required for optimistic locking i.e _primary_term and _seq_no is not given by Scroll API response. This is where PIT with search_after comes useful.
Avoid applying stale data in index - With PIT + search_after, I want to understand what will happen when snapshot of the index is taken and kept it alive for "1m" for each paginated request and the other writes/deletes from Kafka consumers happening simultaneously - how things will work here? Because I'm not sure if I end up applying full update to the document with stale data. I always want to keep the latest data.

Would be great if community can help me with better clarity on the usage and whether this approach would be good or not.

Thanks, Harshit

Understanding efficient use of Point in Time (PIT) with search_after in Elasticsearch

Answers (0)

Related Questions