Elasticsearch find duplicates documents by column value

Question

I want to delete all duplicated documents in an index. I started by trying to detect duplicated items with the following query.
The elastic was crushed because java.lang.OutOfMemoryError: Java heap space. the heap size is 32GB and 500,000,000 docs

how do I find the duplicated values. group by item.category and item.profile (both defined as a keyword).
how delete those duplicated items?

GET /requests/_search
{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "item.text",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDecuments": {
          "top_hits": {
           
          }
        }
      }
    }
  }
}

Thanks

Jaycreation · Accepted Answer

I would write an external script to query docs ids with a pagination, and launch several delete_by_query.

You can use the "partition" on your aggregation as described in the article below to paginate you aggregations. https://spoon-elastic.com/all-elastic-search-post/pagination-aggregation-elasticsearch/

If you have enough space, you can also use an external index with a reindex. With an ingest pipeline, you can :

construct a unique key (item.category+-+item.profile..) and use it as _id
reindex in this index. It will remove the duplicates
use this new index or empty old index and reindex back to it.

Or another solution is :

construct a unique key (item.category+-+item.profile..) and use it as _id
reindex in this index, and keep only unique identifier like _source.id
use all ids to write an external scipt to create a delete_by_query to remove all documents with thoses ids.

remember that :

delete in elastic search is a soft delete. it will takes time before ES free up space.
to keep a good performance after a large delete it's a best practivce to do a force_merge https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html

Elasticsearch find duplicates documents by column value

Answers (1)

Related Questions