Reputation: 12487
So I have a situation where I get in a lot of large XML files and I want that data sycronised on elasticsearch.
Current way
Proposed way
This means out of 500,000 items, I only have to add the 5,000 items that have changed for example, not duplicate the 500,000 items.
Question
In a scenario like this, how to I ensure they are sycronised? For example, what happens if elasticsearch gets wiped, how can I tell my program that it would need to add the whole lot again. Is there a way to use some sort of sycronisation key on elasticsearch, or perhaps a better approach?
Upvotes: 0
Views: 63
Reputation: 53496
Here is what I recommend...
Add a stored field to your type to store a hash like MD5
Use Scan/Scroll to export the ID and Hash from ES
In your backing dataset export ID and Hash
Use something like MapReduce to "join" on exported ids from each set
Where there are differences via comparing the hash or finding missing keys, index/update
The hash is only useful if want to detect document changes. This also assume that either you persist ES's IDs back to your backing store or that you self assign IDs.
Upvotes: 1