Reputation: 25
Lets say I have some data in elasticsearch and I would like to retrieve all of the records where a particular field is present more than once. For example:
{id:1, name: "bob", "age":30}
{id:2, name: "mike", "age":20}
{id:3, name: "bob", "age":30}
{id:4, name: "sarah", "age":40}
{id:5, name: "mike", "age":35}
I want a query that would return multiple occurrences by name. So it should return the following records:
{id:1, name: "bob", "age":30}
{id:2, name: "mike", "age":20}
{id:3, name: "bob", "age":30}
{id:5, name: "mike", "age":35}
So id: 4 is excluded since the name 'sarah' only occurs in one doc. A more preferable return would be something like:
{"name": "bob", "count":2}
{"name": "mike", "count":2}
but can work with the first query return if its easier.
Upvotes: 1
Views: 1139
Reputation: 1286
You can use what is called Aggregations
in Elasticsearch. If you're just looking for duplicate names, you can use a Terms Aggregation
.
Here's an example. You can set up your data like this:
PUT testing/_doc/1
{
"name": "bob",
"age": 30
}
PUT testing/_doc/2
{
"name": "mike",
"age": 20
}
PUT testing/_doc/3
{
"name": "bob",
"age": 30
}
PUT testing/_doc/4
{
"name": "sarah",
"age": 40
}
PUT testing/_doc/5
{
"name": "mike",
"age": 20
}
Then run your aggregation:
GET testing/_doc/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"duplicates": {
"terms": {
"field": "name.keyword",
"min_doc_count": 2
}
}
}
}
This will give you a response like this:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"duplicates": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bob",
"doc_count": 2
},
{
"key": "mike",
"doc_count": 2
}
]
}
}
}
The important part is the aggregations.duplicates.buckets
where the "name"
is shown in the "key"
.
Upvotes: 2