Prototype
Prototype

Reputation: 142

Elasticsearch: Returning only DISTINCT hits based on a namefield

Assuming I have these two values:

{"name": "foo", "value": 0} 
{"name": "foo", "value": 7} 

I only want the last document added to the index if two or more documents have the same name, not all of them, i.e in SQL : SELECT DISTINCT name FROM test_data

I already tried several things, such as:

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
                "field": "name.keyword",
                "min_doc_count": 1
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

But it returns

"aggregations": {
        "duplicateCount": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "foo",
                    "doc_count": 2,
                    "duplicateDocuments": {
                        "hits": {
                            "total": 2,
                            "max_score": 1.0,
                            "hits": [
                                {
                                    "_index": "test_data",
                                    "_type": "doc",
                                    "_id": "VYHNtmQB8mCEn5EB8msO",
                                    "_score": 1.0,
                                    "_source": {
                                        "name": "foo",
                                        "value": 7
                                    }
                                },
                                {
                                    "_index": "test_data",
                                    "_type": "doc",
                                    "_id": "VIHNtmQB8mCEn5EB5Wum",
                                    "_score": 1.0,
                                    "_source": {
                                        "name": "foo",
                                        "value": 2
                                    }
                                }
                            ]
                        }
                    }
                }
            ]
        }
    }

Other than that, when I search for a solution, I only find "how to get a list of distinct values" or "how to count how many distinct values", which is not what I am looking for.

If there is no Elasticsearch-related solution, I was thinking of looping in the results to check if there's already a result with the same name, but that's time consuming. Any ideas ?

Upvotes: 1

Views: 50

Answers (1)

Nikolay Vasiliev
Nikolay Vasiliev

Reputation: 6076

You can try using size parameter of the top_hits aggregation. Consider this query:

POST /my_top_hits/doc/_search
{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
                "field": "name.keyword",
                "min_doc_count": 1
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

Which will give you exactly one hit in the top_hits section:

{
...
  "aggregations": {
    "duplicateCount": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "foo",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "my_top_hits",
                  "_type": "doc",
                  "_id": "_AfNuGQBW4b-XxcaDVib",
                  "_score": 1,
                  "_source": {
                    "name": "foo",
                    "value": 0
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

How to return the last document added to the index?

You can try using sort parameter of the top_hits aggregation. Assuming that value is a serial version number (i.e. the bigger it is the newer the document):

POST /my_top_hits/doc/_search
{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
                "field": "name.keyword",
                "min_doc_count": 1
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 1,
            "sort": [
              {"value": "desc"}
            ]
          }
        }
      }
    }
  }
}

This will return you again only one document, but different from the one in the previous example:

  "aggregations": {
    "duplicateCount": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "foo",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": null,
              "hits": [
                {
                  "_index": "my_top_hits",
                  "_type": "doc",
                  "_id": "_QfNuGQBW4b-XxcaOFjC",
                  "_score": null,
                  "_source": {
                    "name": "foo",
                    "value": 7
                  },
                  "sort": [
                    7
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }

If you don't have a field to sort on you would have to add one: Elasticsearch does not have such facility. It used to have a _timestamp field but it was deprecated long ago.

Do I need to set that "min_doc_count" to 1?

Not really, "min_doc_count" param of the terms aggregation is the default value.

Hope that helps!

Upvotes: 2

Related Questions