shdu12
shdu12

Reputation: 113

Elasticsearch reindex api -not able to copy all the documents

I have set up the destination index new_dest_index prior to running a _reindex action, including setting up mappings, shard counts, replicas, etc.

I ran the below POST command to copy all the documents from source_index to new_dest_index but it looks like it runs in the background and copies only some of the documents, not all the data from source_index.

Can someone please help and also if there are any better ways to copy from one index to another?

POST _reindex
{
  "source": {
    "index": "source_index"
  },
  "dest": {
    "index": "new_dest_index"
  }
}

Upvotes: 3

Views: 3582

Answers (2)

Tuyen Luong
Tuyen Luong

Reputation: 1366

Kevin had already showed the case where reindex task is not finished yet, I answer the case when reindex process is finished.

Note that _reindex API can cause data inconsistent problems which is the new updated (newly inserted + updated) on source_index which happen right after _reindex is triggered, is not applied to the new_dest_index.

For example, bofore you run the _reindex, you add a document:

PUT source_index/doc/3
{
  "id": 3,
  "searchable_name": "version1"
}
//responses
{
  "_index": "source_index",
  "_type": "doc",
  "_id": "3",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "created": true
}

And then you trigger _reindex API, after triggering _reindex, you update your document:

PUT source_index/doc/3
{
  "id": 3,
  "searchable_name": "version2"
}
//responses
{
  "_index": "source_index",
  "_type": "doc",
  "_id": "3",
  "_version": 2,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "created": false
}

But after the _reindex finished, you check the version for the document in new_dest_index:

{
  "_index": "new_dest_index",
  "_type": "doc",
  "_id": "3",
  "_version": 1,
  "found": true,
  "_source": {
    "id": 3,
    "searchable_name": "version1"
  }
}

The same problems can happen for documents which inserted after trigger _reindex One solution for this is that the first time you reindex and keep version of the source_index using version_type= external setting for new_dest_index, after you traffic your write to new_dest_index, you can reindex again from source_index to new_dest_index to reindex the missing new update after _reindex is triggered. You can check these settings in the docs here.

Upvotes: 1

Kevin Quinzel
Kevin Quinzel

Reputation: 1428

I think this is the best way to copy from one index to another.

The reindex process, if I remember correctly, copies bulks of 10,000 each time from one index to another. You are not seeing all documents in the destination index because the tasks hasn't finished (in the best of the cases).

You can always list the reindex tasks with _cat/tasks like:

GET _cat/tasks?v

If you see a reindex tasks in the output, it hasn't finished and you have to wait a little more. These processes take minutes, even hours, depending of the amount of documents to copy.

However, if you don't see it listed and the documents in one index does not match with the number of copied documents in the other one, the reindex process failed and has to be run again.

That last scenario is a bummer when you want to copy all the documents without restrictions.

A way to avoid that las scenario is to reindex with Queries. You can, for instance, run a reindex task for all the documents from January to March, another one for documents from April to June and so on.

You can run several reindex tasks without overlapping. Be mindful with this because having too much tasks could affect the performance or the health of your cluster.

Hope this is helpful! :)

Upvotes: 2

Related Questions