Reputation: 2599
We have 60M documents in an index. hosted on 4 nodes cluster.
I want to make sure the configuration is optimised for aggregations on the documents.
This is the sample query:
select * from sources * where (sddocname contains ([{"implicitTransforms": false}]"tweet")) | all(group(n_tA_c) each(output(count() as(count))));
The field n_tA_c contains array of strings. This is the sample document:
{
"fields": {
"add_gsOrd": 63829,
"documentid": "id:firehose:tweet::815347045032742912",
"foC": 467,
"frC": 315,
"g": 0,
"ln": "en",
"m": "ya just wants some fried rice",
"mTp": 2,
"n_c_p": [],
"n_tA_c": [
"fried",
"rice"
],
"n_tA_s": [],
"n_tA_tC": [],
"sN": "long_delaney1",
"sT_dlC": 0,
"sT_fC": 0,
"sT_lAT": 0,
"sT_qC": 0,
"sT_r": 0.0,
"sT_rC": 467,
"sT_rpC": 0,
"sT_rtC": 0,
"sT_vC": 0,
"sddocname": "tweet",
"t": 1483228858608,
"u": 377606303,
"v": "false"
},
"id": "id:firehose:tweet::815347045032742912",
"relevance": 0.0,
"source": "content-root-cluster"
}
The n_tA_c is attribute with mode fast-search
field n_tA_c type array<string> {
indexing: summary | attribute
attribute: fast-search
}
The simple term aggregation query does not come back in 20s. And times-out. What are additional check-list we need to ensure to reduce this latency?
$ curl 'http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20(sddocname%20contains%20(%5B%7B%22implicitTransforms%22%3A%20false%7D%5D%22tweet%22))%20%7C%20all(group(n_tA_c)%20each(output(count()%20as(count))))%3B' | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 270 100 270 0 0 13 0 0:00:20 0:00:20 --:--:-- 67
{
"root": {
"children": [
{
"continuation": {
"this": ""
},
"id": "group:root:0",
"relevance": 1.0
}
],
"errors": [
{
"code": 12,
"message": "Timeout while waiting for sc0.num0",
"source": "content-root-cluster",
"summary": "Timed out"
}
],
"fields": {
"totalCount": 0
},
"id": "toplevel",
"relevance": 1.0
}
}
These nodes are aws i3.4x large boxes.(16 cores, 120 GB)
I might me missing something silly.
Upvotes: 3
Views: 743
Reputation: 2599
Summarising the checkpoints to take care while making aggregations from the conversation in other answer and more documentation help.
max(x)
in the group for size of buckets needed. When data is distributed across multiple content nodes this result can be inaccurate. To increase accuracy we need to use precision(x)
as well to tune accuracy as we need.limit 0
in the yql; this will save the step to load summary to be returned for container.fast-search
; otherwise it is not B-tree like index - and has to be traversed.&ranking=unranked
in the query.max-hits
as described: http://docs.vespa.ai/documentation/performance/sizing-search.html<persearch>16</persearch>
Threads persearch
is by default 1.
Above changes, ensured that query is returned with result before timeout. But learned that Vespa is not made for aggregations with primary goal. The latency for write and search are much less than ES with same scale on identical hardware. But aggregation (specially with multi-valued string fields) is more CPU intensive and more latency compare to ES for the same aggregation query.
Upvotes: 1
Reputation: 3184
You are asking for every unique value and their count() as your grouping expression does not contain any max(x) limitation, this is a very cpu and network intensive task to compute and limiting number of groups is much faster by e.g
all(group(n_tA_c) max(10) each(output(count() as(count))));
General comments: With vespa like any other serving engine it's important to have enough memory and e.g swap disabled so you can index and search data without getting into high memory pressure.
How much memory you'll use per document type is dependent on several factors but how many fields defined with attribute and number of documents per node is important. Redundancy and number of searchable copies also plays a major role.
Grouping over the entire corpus is memory intensive (memory bandwidth reading attribute values), cpu intensive and also network intensive when there is a high fan-out (See more on the precision here http://docs.vespa.ai/documentation/grouping.html which can limit number of groups returned per node).
Upvotes: 6