dlebech
dlebech

Reputation: 1839

Count how often duplicates occur

In Elasticsearch, I am trying to count the number of distinct field values in the dataset where the field value:

In a sense, I am trying to count how often duplicates occur. How can I do this?

Example

Let's say I have the following Elasticsearch documents:

{ "myfield": "bob" }
{ "myfield": "bob" }
{ "myfield": "alice" }
{ "myfield": "eve" }
{ "myfield": "mallory" }

Since "alice", "eve" and "mallory" appear once, and "bob" appears twice, I would expect:

number_of_values_that_appear_once: 3
number_of_values_that_appear_twice_or_more: 1

I can get part of the way with a terms aggregations and looking at the doc_count of each bucket. The output of a terms aggregation on myfield would look something like:

"buckets": [
  {
    "key": "bob",
    "doc_count": 3
  },
  {
    "key": "alice",
    "doc_count": 1
  },
  ...
]

From this output, I could just sum the number of buckets where doc_count == 1 for example. But this does not scale because I often have many thousands of distinct values so the bucket list would be enormous.

Upvotes: 5

Views: 2056

Answers (2)

Pratik Patil
Pratik Patil

Reputation: 107

You can count duplicates via a scripted_metric based solution. A similar solution is explained in article "Accurate Distinct Count and Values from Elasticsearch". All you need to do is modify the solution query to count each occurrence of unique value instead of counting the unique values themselves.

Upvotes: 1

Igor Belo
Igor Belo

Reputation: 738

Aggregations are affected by your query, so, if you want to find duplications just run the query below:

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "YOUR_AGGREGATION_NAME": {
      "terms": {
        "field": "myfield"
      }
    }
  }
}

ps1: The size key just omits the results/hits (except the total).

ps2: The query key is matching all the documents in the index.

Upvotes: 0

Related Questions