Reputation: 142
Assuming I have these two values:
{"name": "foo", "value": 0}
{"name": "foo", "value": 7}
I only want the last document added to the index if two or more documents have the same name, not all of them, i.e in SQL : SELECT DISTINCT name FROM test_data
I already tried several things, such as:
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "name.keyword",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {}
}
}
}
}
}
But it returns
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "foo",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": 1.0,
"hits": [
{
"_index": "test_data",
"_type": "doc",
"_id": "VYHNtmQB8mCEn5EB8msO",
"_score": 1.0,
"_source": {
"name": "foo",
"value": 7
}
},
{
"_index": "test_data",
"_type": "doc",
"_id": "VIHNtmQB8mCEn5EB5Wum",
"_score": 1.0,
"_source": {
"name": "foo",
"value": 2
}
}
]
}
}
}
]
}
}
Other than that, when I search for a solution, I only find "how to get a list of distinct values" or "how to count how many distinct values", which is not what I am looking for.
If there is no Elasticsearch-related solution, I was thinking of looping in the results to check if there's already a result with the same name, but that's time consuming. Any ideas ?
Upvotes: 1
Views: 50
Reputation: 6076
You can try using size
parameter of the top_hits
aggregation. Consider this query:
POST /my_top_hits/doc/_search
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "name.keyword",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 1
}
}
}
}
}
}
Which will give you exactly one hit in the top_hits
section:
{
...
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "foo",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_top_hits",
"_type": "doc",
"_id": "_AfNuGQBW4b-XxcaDVib",
"_score": 1,
"_source": {
"name": "foo",
"value": 0
}
}
]
}
}
}
]
}
}
}
You can try using sort
parameter of the top_hits
aggregation.
Assuming that value
is a serial version number (i.e. the bigger it is the newer the document):
POST /my_top_hits/doc/_search
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "name.keyword",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 1,
"sort": [
{"value": "desc"}
]
}
}
}
}
}
}
This will return you again only one document, but different from the one in the previous example:
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "foo",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "my_top_hits",
"_type": "doc",
"_id": "_QfNuGQBW4b-XxcaOFjC",
"_score": null,
"_source": {
"name": "foo",
"value": 7
},
"sort": [
7
]
}
]
}
}
}
]
}
}
If you don't have a field to sort on you would have to add one: Elasticsearch does not have such facility. It used to have a _timestamp
field but it was deprecated long ago.
"min_doc_count"
to 1?Not really, "min_doc_count"
param of the terms
aggregation is the default value.
Hope that helps!
Upvotes: 2