How to make aggregations fast on Vespa?

Question

We have 60M documents in an index. hosted on 4 nodes cluster.

I want to make sure the configuration is optimised for aggregations on the documents.

This is the sample query:

select * from sources * where (sddocname contains ([{"implicitTransforms": false}]"tweet")) | all(group(n_tA_c) each(output(count() as(count))));

The field n_tA_c contains array of strings. This is the sample document:

        {
            "fields": {
                "add_gsOrd": 63829,
                "documentid": "id:firehose:tweet::815347045032742912",
                "foC": 467,
                "frC": 315,
                "g": 0,
                "ln": "en",
                "m": "ya just wants some fried rice",
                "mTp": 2,
                "n_c_p": [],
                "n_tA_c": [                        
                    "fried",
                    "rice"
                ],
                "n_tA_s": [],
                "n_tA_tC": [],
                "sN": "long_delaney1",
                "sT_dlC": 0,
                "sT_fC": 0,
                "sT_lAT": 0,
                "sT_qC": 0,
                "sT_r": 0.0,
                "sT_rC": 467,
                "sT_rpC": 0,
                "sT_rtC": 0,
                "sT_vC": 0,
                "sddocname": "tweet",
                "t": 1483228858608,
                "u": 377606303,
                "v": "false"
            },
            "id": "id:firehose:tweet::815347045032742912",
            "relevance": 0.0,
            "source": "content-root-cluster"
        }

The n_tA_c is attribute with mode fast-search

    field n_tA_c type array {
        indexing: summary | attribute
        attribute: fast-search
    }

The simple term aggregation query does not come back in 20s. And times-out. What are additional check-list we need to ensure to reduce this latency?

$ curl 'http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20(sddocname%20contains%20(%5B%7B%22implicitTransforms%22%3A%20false%7D%5D%22tweet%22))%20%7C%20all(group(n_tA_c)%20each(output(count()%20as(count))))%3B' | python -m json.tool
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   270  100   270    0     0     13      0  0:00:20  0:00:20 --:--:--    67
    {
        "root": {
            "children": [
                {
                    "continuation": {
                        "this": ""
                    },
                    "id": "group:root:0",
                    "relevance": 1.0
                }
            ],
            "errors": [
                {
                    "code": 12,
                    "message": "Timeout while waiting for sc0.num0",
                    "source": "content-root-cluster",
                    "summary": "Timed out"
                }
            ],
            "fields": {
                "totalCount": 0
            },
            "id": "toplevel",
            "relevance": 1.0
        }
    }

These nodes are aws i3.4x large boxes.(16 cores, 120 GB)

I might me missing something silly.

Jo Kristian Bergum · Accepted Answer

You are asking for every unique value and their count() as your grouping expression does not contain any max(x) limitation, this is a very cpu and network intensive task to compute and limiting number of groups is much faster by e.g

all(group(n_tA_c) max(10) each(output(count() as(count))));

General comments: With vespa like any other serving engine it's important to have enough memory and e.g swap disabled so you can index and search data without getting into high memory pressure.

How much memory you'll use per document type is dependent on several factors but how many fields defined with attribute and number of documents per node is important. Redundancy and number of searchable copies also plays a major role.

Grouping over the entire corpus is memory intensive (memory bandwidth reading attribute values), cpu intensive and also network intensive when there is a high fan-out (See more on the precision here http://docs.vespa.ai/documentation/grouping.html which can limit number of groups returned per node).

How to make aggregations fast on Vespa?

Answers (2)

Related Questions