SebScoFr
SebScoFr

Reputation: 901

Elasticsearch reindex API partial update

So, we are in a situation where we need to reindex Elasticsearch documents from one index to another. We're using the reindex API for that. Though sometimes the document already exists in the destination index. Setting version_type: "external" makes it so that the document in the destination index is updated which works great, except that it performs a full update, I'd like it to perform a partial update on that document. Something like setting ctx.op = "partial" would be nice but it's apparently not implemented as of today. Any alternative ideas for achieving this would be appreciated.

PS: I'd like to avoid to query the source index for every documents and sending them individually to the destination with upsert, for performances reasons it seems that would be quite slow compared to the reindex API.

Upvotes: 0

Views: 2474

Answers (1)

Nikolay Vasiliev
Nikolay Vasiliev

Reputation: 6076

Disclaimer: this answer has been updated.

To achieve a partial update you may define a script).

In theory you may apply any transformation you want to the document being reindexed.

(End of original answer.)


Implementing custom reindex-and-merge

As the author of the question pointed out, it does not help if one needs to merge two documents, the one already existing in the resulting index and a new one.

Elasticsearch _reindex method was introduced in version 2.3 and was considered experimental; it looks like it was simply a combination of a scroll query with bulk insert API. I make this conclusion based on the fact that this page in Definitive Guide suggests to reindex your data in this way:

To reindex all of the documents from the old index efficiently, use scroll to retrieve batches of documents from the old index, and the bulk API to push them into the new index.

Now, to address the need of partial update. The process of reindex-and-merge can be roughly divided into four stages:

  1. reading document from the index A
  2. reading document from the index B
  3. merging documents
  4. inserting new document into B

Stages 1 and 4 are actually an original scenario of reindex call; what makes it different now is the need to join with another index and merge the documents.

I would propose to write a custom script and use scroll for reading the index A in streaming fashion, bulk API for retrieving documents from the index B, custom code for merging documents and bulk API for inserting documents. Performance of such script will be at least comparable with original reindex implementation. (Also make sure that you check out this page with index performance tuning tips, in particular increase/disable index.refresh_interval.)

There are of course other options, that are not relevant to ElasticSearch and which the author of this question might have already considered (like dumping both indexes, joining them with custom code and inserting the new index).

Hope this helps.

Upvotes: 1

Related Questions