Chandra
Chandra

Reputation: 1607

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.

{
    "index": {
        "analysis": {
            "analyzer": {
                "prefix-test-analyzer": {
                    "filter": "dotted",
                    "tokenizer": "prefix-test-tokenizer",
                    "type": "custom"
                }
            },
            "filter": {
                "dotted": {
                    "patterns": [
                        "([^.]+)"
                    ],
                    "type": "pattern_capture"
                }
            },
            "tokenizer": {
                "prefix-test-tokenizer": {
                    "delimiter": ".",
                    "type": "path_hierarchy"
                }
            }
        }
    }
}

{
    "metrics": {
        "_routing": {
            "required": true
        },
        "properties": {
            "tenantId": {
                "type": "string",
                "index": "not_analyzed"
            },
            "unit": {
                "type": "string",
                "index": "not_analyzed"
            },
            "metric_name": {
                "index_analyzer": "prefix-test-analyzer",
                "search_analyzer": "keyword",
                "type": "string"
            }
        }
    }
}

The above index creates the following terms for a metric name foo.bar.baz

foo
bar
baz
foo.bar
foo.bar.baz

If I have bunch of metrics, like below

a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z

I have to write a query to grab the nth level of tokens. In the example above

for level = 0, I should get [a, x] 
for level = 1, with 'a' as first token I should get [b]
               with 'x' as first token I should get [y]  
for level = 2, with 'a.b' as first token I should get [c, m]

I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.

time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
      "size": 0,
      "query": {
        "term": {
            "tenantId": "12345"
        }
      },
      "aggs": {
          "metric_name_tokens": {
              "terms": {
                  "field" : "metric_name",
                  "include": "a[.]b[.][^.]*",
                  "execution_hint": "map",
                  "size": 0
              }
          }
      }
  }'

This would result in the following buckets. I parse the output and grab [c, m] from there.

"buckets" : [ {
     "key" : "a.b.c",
     "doc_count" : 2
   }, {
     "key" : "a.b.m",
     "doc_count" : 1
 } ]

So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.

I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.

Upvotes: 1

Views: 465

Answers (1)

Andrei Stefan
Andrei Stefan

Reputation: 52368

Some suggestions:

  • "mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
  "must": [
    {
      "term": {
        "tenantId": 123
      }
    },
    {
      "prefix": {
        "metric_name": {
          "value": "a.b."
        }
      }
    }
  ]
}

or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less. You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.

  • change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
  • remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
  • the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.

From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

Upvotes: 2

Related Questions