John Sydnor
John Sydnor

Reputation: 21

Elastcsearch aggregation (duplicate) search not returning all duplicates

I am searching for and counting duplicated phrases within a single, or group of, human readable documents. I break each document into phrases/sentences and populate an Elasticsearch index with these phrases, one per ES document.

I have 707 documents in my index. I KNOW that I should have, at least, 21 duplicate documents. My search is returning 19 duplicate docs. I don't understand why I am missing some matches. Here is my query:

{
    "size": 0,
    "aggs": {
        "duplicateCount": {
            "terms": {
                "field": "content",
                "min_doc_count": 2
            },
            "aggs": {
                "duplicateDocuments": {
                    "top_hits": {

                    }
                }
            }
        }
    }
}

My process:

  1. Create index
  2. Build bulk insert data objects
  3. Bulk insert documents into index
  4. Reindex documents
  5. Run duplicates query (above)
  6. Parse results - SUM buckets.doc_counts
  7. delete index

NOTE: Since Elastic Search will match words, not phrases/sentences, I md5 hash each phrase/sentence before insert into my index.

More detail can be provided (I didn't want my post to be too massive).

Why is ES not returning all duplicates????

Thanks

UPDATE: When creating my index I set the shards property to 1 and this helped return a few more duplicates but still not all.

Upvotes: 1

Views: 380

Answers (1)

Amar Tari
Amar Tari

Reputation: 1

If you know approximate size of the document , add it like below:

 "aggs": {
"productId": {
  "terms": {
    "field": "productId",
    "min_doc_count": 2,
    "size": 1000
  }
}

}

Please check if this will fix your problem.

Upvotes: 0

Related Questions