Reputation: 21
I am searching for and counting duplicated phrases within a single, or group of, human readable documents. I break each document into phrases/sentences and populate an Elasticsearch index with these phrases, one per ES document.
I have 707 documents in my index. I KNOW that I should have, at least, 21 duplicate documents. My search is returning 19 duplicate docs. I don't understand why I am missing some matches. Here is my query:
{ "size": 0, "aggs": { "duplicateCount": { "terms": { "field": "content", "min_doc_count": 2 }, "aggs": { "duplicateDocuments": { "top_hits": { } } } } } }
My process:
NOTE: Since Elastic Search will match words, not phrases/sentences, I md5 hash each phrase/sentence before insert into my index.
More detail can be provided (I didn't want my post to be too massive).
Why is ES not returning all duplicates????
Thanks
UPDATE: When creating my index I set the shards property to 1 and this helped return a few more duplicates but still not all.
Upvotes: 1
Views: 380
Reputation: 1
If you know approximate size of the document , add it like below:
"aggs": {
"productId": {
"terms": {
"field": "productId",
"min_doc_count": 2,
"size": 1000
}
}
}
Please check if this will fix your problem.
Upvotes: 0