AshwinK
AshwinK

Reputation: 1088

How Elastic Search delete_by_query work ? What happens when we insert new data and retrieve the same while deleting documents?

I wanted to know more about elastic delete, it's Java high level delete api & weather it's feasible to perform bulk delete.

Following are the config information

In my case daily around 10K records are added into the index dev-answer. I want to trigger delete operation (this can be triggered daily or once in a week or once in a month) which will basically delete all documents from above index if specific condition is satisfied. (Which I'll give in DeleteByQueryRequest)

For delete there is an api as given in latest doc which I'm referring.

DeleteByQueryRequest request = new DeleteByQueryRequest("source1", "source2");

While reading the documentation I came across following queries which I'm unable to understand.

  1. As in doc: It’s also possible to limit the number of processed documents by setting size. request.setSize(10); What does processed document means ? Will it delete only 10 documents ?

  2. What batch size I should set ? request.setBatchSize(100); it's performance is based on how many documents we are going to delete ?

    Should I first make a call to get no of documents & based on that setBatchSize should be changed ?

  3. request.setSlices(2); Slices should be depend on how many cores executor machine have or on no of cores in elastic cluster ?

  4. In documentation the method setSlices(2) is given which I'm unable to find in class org.elasticsearch.index.reindex.DeleteByQueryRequest. What I'm missing here ?

  5. Let's consider if I'm executing this delete query in async mode which is taking 0.5-1.0 sec, meanwhile if I'm doing get request on this index, will it give some exception ? Also in the same time if I inserted new document & retrieving the same, will it be able to give response ?

Upvotes: 7

Views: 5373

Answers (1)

Pierre-Nicolas Mougel
Pierre-Nicolas Mougel

Reputation: 2279

1. As in doc: It’s also possible to limit the number of processed documents by setting size. request.setSize(10); What does processed document means ? Will it delete only 10 documents ?

If you have not already you should read the search/_scroll documentation. _delete_by_query performs a scroll search using the query given as parameter.

The size parameter corresponds to the number of documents returned by each call to the scroll endpoint. If you have 10 documents matching your query and a size of 2, elasticsearch will internally performs 5 search/_scroll calls (i.e., 5 batches) while if you set a size to 5, only 2 search/_scroll calls will be performed.

Regardless of the size parameter all documents matching the query will be removed but it will be more or less efficient.

2. What batch size I should set ? request.setBatchSize(100); it's performance is based on how many documents we are going to delete ?

setBatchSize() method is equivalent to set the size parameter in the query. You can read this article to determine the correct value for the size parameter.

3. Should I first make a call to get no of documents & based on that setBatchSize should be changed ?

You would have to run the search request twice to get the number of deleted documents, I believe that it would not be efficient. I advise you to find and stick to a constant value.

4. Slices should be depend on how many cores executor machine have or on no of cores in elastic cluster ?

The number of slice should be set from the elasticsearch cluster configuration. It also to parallelize the search both between the shards and within the shards.

You can read the documentation for hints on how to set this parameter. Usually the number of shards for your index.

5. In documentation the method setSlices(2) is given which I'm unable to find in class org.elasticsearch.index.reindex.DeleteByQueryRequest. What I'm missing here ?

You are right, that is probably an error in the documentation. I have never tried it, but I believe you should use forSlice(TaskId slicingTask, SearchRequest slice, int totalSlices).

6. Let's consider if I'm executing this delete query in async mode which is taking 0.5-1.0 sec, meanwhile if I'm doing get request on this index, will it give some exception ? Also in the same time if I inserted new document & retrieving the same, will it be able to give response ?

First, as stated in the documentation, the _delete_by_query endpoint create a snapshot of the index and work on this copy.

For a get request, it depends if the document has already been deleted or not. It will never send an exception, you will just have the same result has if you where retrieving an existing or a non existing document. Please note that unless you specify a sort in the search query, the order of deletion for the documents is not determined.

If you insert (or update) a document during the processing, this document will not be taken into account by the _delete_by_query endpoint, even if it matches the _delete_by_query query. This is where the snapshot is used. So if you insert a new document, you will be able to retrieve it. Same if you update an existing document, the document will be created again if it has already been deleted or updated but not deleted if it has not been deleted yet.

As a side note, deleted documents will still be searchable (even after the delete_by_query task has finished) until a refresh operation has occurred.

_delete_by_query does not support refresh parameter. The request return mentionned in the documentation for the refresh operation refers to requests that can have a refresh parameter. If you want to force a refresh you can use the _refresh endpoint. By default, refresh operation occur every 1 second. So once the _delete_by_query operation is finished after at most 1 second, the deleted documents will not be searchable.

Upvotes: 5

Related Questions