mrmarbles
mrmarbles

Reputation: 49

Elasticsearch java API bulk delete not working

I'm attempting perform a bulk delete of documents whose id's are derived from a previous search. The query to determine the documents that are candidates for deletion is producing desired results (thousands of records) however the bulk delete only deletes 10 records at a time, even though I'm feeding it all of the results of the original query;

Client client = node.client();
BulkRequestBuilder bulkRequest = client.prepareBulk();

SearchResponse deletes = client.prepareSearch("my_index")
        .setTypes("my_doc_type")
        .setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
        .setQuery(boolQuery().mustNot(termQuery("tId", transactionId)))
        .execute()
        .actionGet();

long deleteHits = deletes.getHits().getTotalHits();

if (deleteHits > 0) {

    logger.info("Preparing to delete (" + deleteHits + ") " +
            "documents from index");

    Arrays.asList(deletes.getHits().getHits()).stream().forEach(h ->
            bulkRequest.add(client.prepareDelete()
                .setIndex("my_index")
                .setType("my_doc_type")
                .setId(h.getId())));
    }

    BulkResponse bulkResponse = bulkRequest.execute().actionGet();

    if (bulkResponse.hasFailures()) {
        throw new RuntimeException(bulkResponse.buildFailureMessage());
    }

}

Upvotes: 1

Views: 3971

Answers (2)

imotov
imotov

Reputation: 30163

By default, the search response returns only top 10 results. So, while deletes .getHits().getTotalHits() can be in thousands or even in millions, the size of deletes.getHits().getHits() will never be more than you specified in the size parameter of your request, which 10 by default.

A naive approach would be to try paginating throw the results using normal search by changing the from parameter. However, this can lead to missing to delete some records since each command will execute a new search and the result of the next search can get shifted comparing to the previous search as a result of deleting records on the previous search.

A proper approach is to use specialized scan and scroll search to paginate throw the records. This type of search will keep the results consistent between calls. An example, of this approach can be found in the delete by query plugin that will be available in v2.0.

I also need to note that while the delete by query functionality exists in the previous versions of elasticsearch and it might seem to be the easiest solution for your problem, I would still recommend to use scan/scroll because of poor performance and fragility of existing delete by query API implementation in pre-v2.0.

Upvotes: 2

maximede
maximede

Reputation: 1823

deletes.getHits().getTotalHits give you the total number of hits for the search but SearchResponse deletes do not contains all the results. You'll need to paginate over it.

you'll need to use something like this to define the paging

client.prepareSearch("my_index").setFrom(int from).setSize(int pageSize);

Upvotes: 0

Related Questions