Reputation: 3427
ElasticSearch 6.4 - given an index with documents with a field called CaptureId
and a field called SourceId
: we need to search for duplicate records by CaptureId
value. The SourceId
field can have many records with the same value, and we want to return only one SourceId
per set of duplicates found. So the output would be a list of SourceIds
(listed only one time each) which contain any number of duplicate CaptureId
values.
How would I create this query in ElasticSearch?
Here is the document mapping:
"mappings": {
"fla_doc": {
"_field_names": {
"enabled": false
},
"properties": {
"captureId": {
"type": "long"
},
"capturedDateTime": {
"type": "date"
},
"language": {
"type": "text"
},
"sourceId": {
"type": "long"
},
"sourceListType": {
"type": "text"
},
"region": {
"type": "text"
}
}
}
}
Upvotes: 1
Views: 5002
Reputation: 16895
Assuming both of these ID fields are of the keyword
data type, you could do the following:
GET index_name/_search
{
"size": 0,
"aggs": {
"by_duplicate_capture": {
"terms": {
"field": "CaptureId",
"min_doc_count": 2
},
"aggs": {
"by_underlying_source_ids": {
"terms": {
"field": "SourceId",
"size": 1
}
}
}
}
}
}
In case you're interested in more SourceIDs
, increase the size
param.
Upvotes: 1