enator
enator

Reputation: 2599

How to make aggregations fast on Vespa?

We have 60M documents in an index. hosted on 4 nodes cluster.

I want to make sure the configuration is optimised for aggregations on the documents.

This is the sample query:

select * from sources * where (sddocname contains ([{"implicitTransforms": false}]"tweet")) | all(group(n_tA_c) each(output(count() as(count))));

The field n_tA_c contains array of strings. This is the sample document:

        {
            "fields": {
                "add_gsOrd": 63829,
                "documentid": "id:firehose:tweet::815347045032742912",
                "foC": 467,
                "frC": 315,
                "g": 0,
                "ln": "en",
                "m": "ya just wants some fried rice",
                "mTp": 2,
                "n_c_p": [],
                "n_tA_c": [                        
                    "fried",
                    "rice"
                ],
                "n_tA_s": [],
                "n_tA_tC": [],
                "sN": "long_delaney1",
                "sT_dlC": 0,
                "sT_fC": 0,
                "sT_lAT": 0,
                "sT_qC": 0,
                "sT_r": 0.0,
                "sT_rC": 467,
                "sT_rpC": 0,
                "sT_rtC": 0,
                "sT_vC": 0,
                "sddocname": "tweet",
                "t": 1483228858608,
                "u": 377606303,
                "v": "false"
            },
            "id": "id:firehose:tweet::815347045032742912",
            "relevance": 0.0,
            "source": "content-root-cluster"
        }

The n_tA_c is attribute with mode fast-search

    field n_tA_c type array<string> {
        indexing: summary | attribute
        attribute: fast-search
    }

The simple term aggregation query does not come back in 20s. And times-out. What are additional check-list we need to ensure to reduce this latency?

$ curl 'http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20(sddocname%20contains%20(%5B%7B%22implicitTransforms%22%3A%20false%7D%5D%22tweet%22))%20%7C%20all(group(n_tA_c)%20each(output(count()%20as(count))))%3B' | python -m json.tool
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   270  100   270    0     0     13      0  0:00:20  0:00:20 --:--:--    67
    {
        "root": {
            "children": [
                {
                    "continuation": {
                        "this": ""
                    },
                    "id": "group:root:0",
                    "relevance": 1.0
                }
            ],
            "errors": [
                {
                    "code": 12,
                    "message": "Timeout while waiting for sc0.num0",
                    "source": "content-root-cluster",
                    "summary": "Timed out"
                }
            ],
            "fields": {
                "totalCount": 0
            },
            "id": "toplevel",
            "relevance": 1.0
        }
    }

These nodes are aws i3.4x large boxes.(16 cores, 120 GB)

I might me missing something silly.

Upvotes: 3

Views: 743

Answers (2)

enator
enator

Reputation: 2599

Summarising the checkpoints to take care while making aggregations from the conversation in other answer and more documentation help.

<persearch>16</persearch>

Threads persearch is by default 1.

Above changes, ensured that query is returned with result before timeout. But learned that Vespa is not made for aggregations with primary goal. The latency for write and search are much less than ES with same scale on identical hardware. But aggregation (specially with multi-valued string fields) is more CPU intensive and more latency compare to ES for the same aggregation query.

Upvotes: 1

Jo Kristian Bergum
Jo Kristian Bergum

Reputation: 3184

You are asking for every unique value and their count() as your grouping expression does not contain any max(x) limitation, this is a very cpu and network intensive task to compute and limiting number of groups is much faster by e.g

all(group(n_tA_c) max(10) each(output(count() as(count))));

General comments: With vespa like any other serving engine it's important to have enough memory and e.g swap disabled so you can index and search data without getting into high memory pressure.

How much memory you'll use per document type is dependent on several factors but how many fields defined with attribute and number of documents per node is important. Redundancy and number of searchable copies also plays a major role.

Grouping over the entire corpus is memory intensive (memory bandwidth reading attribute values), cpu intensive and also network intensive when there is a high fan-out (See more on the precision here http://docs.vespa.ai/documentation/grouping.html which can limit number of groups returned per node).

Upvotes: 6

Related Questions