Reputation: 901
So, we are in a situation where we need to reindex Elasticsearch documents from one index to another. We're using the reindex API for that. Though sometimes the document already exists in the destination index. Setting version_type: "external"
makes it so that the document in the destination index is updated which works great, except that it performs a full update, I'd like it to perform a partial update on that document.
Something like setting ctx.op = "partial"
would be nice but it's apparently not implemented as of today.
Any alternative ideas for achieving this would be appreciated.
PS: I'd like to avoid to query the source index for every documents and sending them individually to the destination with upsert, for performances reasons it seems that would be quite slow compared to the reindex API.
Upvotes: 0
Views: 2474
Reputation: 6076
Disclaimer: this answer has been updated.
To achieve a partial update you may define a script).
In theory you may apply any transformation you want to the document being reindexed.
(End of original answer.)
As the author of the question pointed out, it does not help if one needs to merge two documents, the one already existing in the resulting index and a new one.
Elasticsearch _reindex
method was introduced in version 2.3 and was considered experimental; it looks like it was simply a combination of a scroll query with bulk insert API. I make this conclusion based on the fact that this page in Definitive Guide suggests to reindex your data in this way:
To reindex all of the documents from the old index efficiently, use scroll to retrieve batches of documents from the old index, and the bulk API to push them into the new index.
Now, to address the need of partial update. The process of reindex-and-merge can be roughly divided into four stages:
Stages 1 and 4 are actually an original scenario of reindex
call; what makes it different now is the need to join with another index and merge the documents.
I would propose to write a custom script and use scroll
for reading the index A in streaming fashion, bulk API for retrieving documents from the index B, custom code for merging documents and bulk API for inserting documents. Performance of such script will be at least comparable with original reindex
implementation. (Also make sure that you check out this page with index performance tuning tips, in particular increase/disable index.refresh_interval
.)
There are of course other options, that are not relevant to ElasticSearch and which the author of this question might have already considered (like dumping both indexes, joining them with custom code and inserting the new index).
Hope this helps.
Upvotes: 1