SexyMF
SexyMF

Reputation: 11155

Elasticsearch find duplicates documents by column value

I want to delete all duplicated documents in an index. I started by trying to detect duplicated items with the following query.
The elastic was crushed because java.lang.OutOfMemoryError: Java heap space. the heap size is 32GB and 500,000,000 docs

  1. how do I find the duplicated values. group by item.category and item.profile (both defined as a keyword).

  2. how delete those duplicated items?

GET /requests/_search
{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "item.text",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDecuments": {
          "top_hits": {
           
          }
        }
      }
    }
  }
}

Thanks

Upvotes: 1

Views: 894

Answers (1)

Jaycreation
Jaycreation

Reputation: 2089

I would write an external script to query docs ids with a pagination, and launch several delete_by_query.

You can use the "partition" on your aggregation as described in the article below to paginate you aggregations. https://spoon-elastic.com/all-elastic-search-post/pagination-aggregation-elasticsearch/

If you have enough space, you can also use an external index with a reindex. With an ingest pipeline, you can :

  1. construct a unique key (item.category+-+item.profile..) and use it as _id
  2. reindex in this index. It will remove the duplicates
  3. use this new index or empty old index and reindex back to it.

Or another solution is :

  1. construct a unique key (item.category+-+item.profile..) and use it as _id
  2. reindex in this index, and keep only unique identifier like _source.id
  3. use all ids to write an external scipt to create a delete_by_query to remove all documents with thoses ids.

remember that :

Upvotes: 1

Related Questions