Reputation: 1607
I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux
. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b
, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId
term
query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Upvotes: 1
Views: 465
Reputation: 52368
Some suggestions:
a.b.
matching, use the following as a query and keep the same aggs section:"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp
with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp
is working better for you, my initial guess was that the prefix
would perform better.
"size": 0
from aggregations to "size": 100
. After testing you mentioned this doesn't make any difference"execution_hint": "map"
and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint
was performing far worse.a.b
in field2
, a.b.c
in field3
and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.
Upvotes: 2