Jimmy
Jimmy

Reputation: 12487

Keeping elasticsearch in sync with key or versioning

So I have a situation where I get in a lot of large XML files and I want that data sycronised on elasticsearch.

Current way

Proposed way

This means out of 500,000 items, I only have to add the 5,000 items that have changed for example, not duplicate the 500,000 items.

Question

In a scenario like this, how to I ensure they are sycronised? For example, what happens if elasticsearch gets wiped, how can I tell my program that it would need to add the whole lot again. Is there a way to use some sort of sycronisation key on elasticsearch, or perhaps a better approach?

Upvotes: 0

Views: 63

Answers (1)

Andrew White
Andrew White

Reputation: 53496

Here is what I recommend...

  1. Add a stored field to your type to store a hash like MD5

  2. Use Scan/Scroll to export the ID and Hash from ES

  3. In your backing dataset export ID and Hash

  4. Use something like MapReduce to "join" on exported ids from each set

  5. Where there are differences via comparing the hash or finding missing keys, index/update

The hash is only useful if want to detect document changes. This also assume that either you persist ES's IDs back to your backing store or that you self assign IDs.

Upvotes: 1

Related Questions