Keeping elasticsearch in sync with key or versioning

Question

So I have a situation where I get in a lot of large XML files and I want that data sycronised on elasticsearch.

Current way

Have index_1
When data is updated create blank index_2
Load all of latest data into index_2
Alias to index_2 and delete index_1

Proposed way

Have a synced.xml file which has been sycronised with elasticsearch
When a new timedated xml file is availiable compare against synced.xml
If anything is new in the timedated xml file, add just that to ES
Rename timedated xml to synced.xml

This means out of 500,000 items, I only have to add the 5,000 items that have changed for example, not duplicate the 500,000 items.

Question

In a scenario like this, how to I ensure they are sycronised? For example, what happens if elasticsearch gets wiped, how can I tell my program that it would need to add the whole lot again. Is there a way to use some sort of sycronisation key on elasticsearch, or perhaps a better approach?

Andrew White · Accepted Answer

Here is what I recommend...

Add a stored field to your type to store a hash like MD5
Use Scan/Scroll to export the ID and Hash from ES
In your backing dataset export ID and Hash
Use something like MapReduce to "join" on exported ids from each set
Where there are differences via comparing the hash or finding missing keys, index/update

The hash is only useful if want to detect document changes. This also assume that either you persist ES's IDs back to your backing store or that you self assign IDs.

Keeping elasticsearch in sync with key or versioning

Answers (1)

Related Questions