Reputation: 11155
I want to delete all duplicated documents in an index. I started by trying to detect duplicated items with the following query.
The elastic was crushed because java.lang.OutOfMemoryError: Java heap space
.
the heap size is 32GB and 500,000,000 docs
how do I find the duplicated values. group by item.category
and item.profile
(both defined as a keyword).
how delete those duplicated items?
GET /requests/_search
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "item.text",
"min_doc_count": 2
},
"aggs": {
"duplicateDecuments": {
"top_hits": {
}
}
}
}
}
}
Thanks
Upvotes: 1
Views: 894
Reputation: 2089
I would write an external script to query docs ids with a pagination, and launch several delete_by_query.
You can use the "partition" on your aggregation as described in the article below to paginate you aggregations. https://spoon-elastic.com/all-elastic-search-post/pagination-aggregation-elasticsearch/
If you have enough space, you can also use an external index with a reindex. With an ingest pipeline, you can :
Or another solution is :
remember that :
Upvotes: 1