Elasticsearch aggregation and filters

Question

Hi friends I am trying to make a search bar in my website. I have thousands of company articles. When i run this code:

GET articles/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "assistant",
            "fields": ["title"]
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "by_company": {
      "terms": {
        "field": "company.keyword",
        "size": 10
      }
    }
  }
}

The result is:

"aggregations": {
"by_company": {
  "doc_count_error_upper_bound": 5,
  "sum_other_doc_count": 409,
  "buckets": [
    {
      "key": "University of Miami",
      "doc_count": 6
    },
    {
      "key": "Brigham & Women's Hospital(BWH)",
      "doc_count": 4
    },

So now I wanna filter articles of University of Miami so i run following query:

GET indeed_psql/job/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "assistant",
            "fields": ["title"]
          }
        }
      ],
      "filter": {
        "term": {
          "company.keyword": "University of Miami"
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "by_company": {
      "terms": {
        "field": "company.keyword",
        "size": 10
      }
    }
  }
}

But now the result is:

"aggregations": {
    "by_company": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "University of Miami",
          "doc_count": 7
        }
      ]
    }

Why there are suddenly seven of them when in the previous aggregation were 6 ??? This also happens with other university filters. What am I doing wrong ? I am not using standard tokenizer and from filters I use english_stemmer, english_stopwords, english_keywords. Thanks for your help.

yyssw · Accepted Answer

It's likely that your first query document counts are wrong. In your first response, the "doc_count_error_upper_bound" is 5, meaning that some of the terms in your returned aggregation were not present as top results in each of the underlying queried shards. The document count will always be too low rather than too high because it could have been "missed" during the process of querying a shard for the top N keys.

How many shards do you have? For instance, if there are 3 shards, and your aggregation size is 3 and your distribution of documents was something like this:

Shard 1      Shard 2     Shard 3
3 BYU        3 UMiami    3 UMiami
2 UMich      2 BWH       2 UMich
2 MGH        2 UMich     1 BWH
1 UMiami     1 MGH       1 BYU

Your resulting top 3 terms from each shard are merged into:

6 UMiami // returned
6 UMich // returned
3 BWH // returned
3 BYU
2 MGH

From which, only the top three results are returned. Almost all these keys are undercounted.

You can see in this scenario, the UMiami document in Shard 1 would not make it into consideration because it is beyond the depth of 3. But if you filter to ONLY look at UMiami, you would necessarily pull back any associated docs in each shard and end up with an accurate count.

You can play around with the shard_size parameter so that Elasticsearch looks a little deeper into each shard too get a more approximate count. But given that there are 7 total documents for this facet, it's likely there's only one occurrence of it on one of your shards so it will be hard to surface it in the top aggregations without grabbing all of the documents for that shard.

You can read more about the count approximation and error derivation here-- tldr, Elasticsearch is making a guess about the total number of documents for that facet based on top aggregations in each individual shard.

Elasticsearch aggregation and filters

Answers (1)

Related Questions