chaim
chaim

Reputation: 1276

Aggregate query over multiple columns (one is an array) in clickhouse

I'm trying to get aggregates values for each att1, and att2 column, and also for each value of the arrays in att3 column.

As far I tried:

create table test(value Float32, att1 String, att2 String, att3 Array(String))
 ENGINE=MergeTree() ORDER BY ();
INSERT INTO test VALUES (2.0, 'a', 'Z', ['sports', 'office', 'anothertag'])
INSERT INTO test VALUES (4.0, 'b', 'X', ['sports', 'office', 'tag'])
INSERT INTO test VALUES (6.0, 'b', 'X', ['sports', 'internet', 'planes'])
SELECT * from test;
┌─value─┬─att1─┬─att2─┬─att3───────────────────────────┐
│     6 │ b    │ X    │ ['sports','internet','planes'] │
└───────┴──────┴──────┴────────────────────────────────┘
┌─value─┬─att1─┬─att2─┬─att3─────────────────────────────┐
│     2 │ a    │ Z    │ ['sports','office','anothertag'] │
└───────┴──────┴──────┴──────────────────────────────────┘
┌─value─┬─att1─┬─att2─┬─att3──────────────────────┐
│     4 │ b    │ X    │ ['sports','office','tag'] │
└───────┴──────┴──────┴───────────────────────────┘

I want to get the aggregate -sum(value)- for each different attribute.

I have it working for att1 and att2 columns with:

SELECT
    att1,
    att2,
    sum(value)
FROM test
GROUP BY
    att1,
    att2
    WITH CUBE

Result:

┌─att1─┬─att2─┬─sum(value)─┐
│ b    │ X    │         10 │
│ a    │ Z    │          2 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│ a    │      │          2 │
│ b    │      │         10 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│      │ Z    │          2 │
│      │ X    │         10 │
└──────┴──────┴────────────┘
┌─att1─┬─att2─┬─sum(value)─┐
│      │      │         12 │
└──────┴──────┴────────────┘

Which gives me more than needed, but results two and three give correct results.

But I also need the value for each value on att3, I have it working in another query, but when trying to make a single query:

SELECT
    att1,
    att2,
    arrayJoin(att3) AS tags,
    sum(value)
FROM test
GROUP BY
    att1,
    att2,
    tags
    WITH CUBE

Which gives (among other things):

┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ a    │      │      │          6 │
│ b    │      │      │         30 │
└──────┴──────┴──────┴────────────┘

┌─att1─┬─att2─┬─tags───────┬─sum(value)─┐
│      │      │ tag        │          4 │
│      │      │ anothertag │          2 │
│      │      │ planes     │          6 │
│      │      │ sports     │         12 │
│      │      │ internet   │          6 │
│      │      │ office     │          6 │
└──────┴──────┴────────────┴────────────┘

Since arrayJoin 'unfolds' array into rows, now values of sum(value) in att1 are not accurate.

I've also tried the LEFT ARRAY JOIN syntax with same results.

Updated:

The ideal result would be something like:

┌─'att1'─┬─'att2'─┬─'tags'─┬─'sum(value)'─┐
│    a   │        │        │         2    │
│    b   │        │        │         10   │
│        │    X   │        │         10   │
│        │    Z   │        │         2    │
│        │        │ sports │         12   │
│        │        │ office │         6    │
│        │        │ anot.. │         2    │
│        │        │ tag    │         4    │
│        │        │internet│         6    │
│        │        │planes  │         6    │
└────────┴────────┴────────┴──────────────┘

Could be in different rows (results), but ideally in one single query.

Upvotes: 0

Views: 3991

Answers (2)

Denny Crane
Denny Crane

Reputation: 13370

SELECT
    sumMap(([att1], [value])) AS r1,
    sumMap(([att2], [value])) AS r2,
    sumMap((att3, replicate(value, att3))) AS r3
FROM test
┌─r1─────────────────┬─r2─────────────────┬─r3──────────────────────────────────────────────────────────────────────────┐
│ (['a','b'],[2,10]) │ (['X','Z'],[10,2]) │ (['anothertag','internet','office','planes','sports','tag'],[2,6,6,6,12,4]) │
└────────────────────┴────────────────────┴─────────────────────────────────────────────────────────────────────────────┘



SELECT
    (arrayJoin(arrayZip((arrayJoin([sumMap(([att1], [value])), sumMap(([att2], [value])), sumMap((att3, replicate(value, att3)))]) AS r).1, r.2)) AS x).1 AS y,
    x.2 AS z
FROM test
┌─y──────────┬──z─┐
│ a          │  2 │
│ b          │ 10 │
│ X          │ 10 │
│ Z          │  2 │
│ anothertag │  2 │
│ internet   │  6 │
│ office     │  6 │
│ planes     │  6 │
│ sports     │ 12 │
│ tag        │  4 │
└────────────┴────┘

Upvotes: 1

vladimir
vladimir

Reputation: 15236

I think the more straightforward way is to combine two queries:

SELECT
    att1,
    att2,
    '' AS tags,
    sum(value)
FROM test
GROUP BY
    att1,
    att2
    WITH CUBE
UNION ALL
SELECT
    '' AS att1,
    '' AS att2,
    arrayJoin(att3) AS tags,
    sum(value)
FROM test
GROUP BY tags

/*
┌─att1─┬─att2─┬─tags───────┬─sum(value)─┐
│      │      │ internet   │          6 │
│      │      │ sports     │         12 │
│      │      │ office     │          6 │
│      │      │ tag        │          4 │
│      │      │ planes     │          6 │
│      │      │ anothertag │          2 │
└──────┴──────┴────────────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ b    │ X    │      │         10 │
│ a    │ Z    │      │          2 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│ a    │      │      │          2 │
│ b    │      │      │         10 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│      │ Z    │      │          2 │
│      │ X    │      │         10 │
└──────┴──────┴──────┴────────────┘
┌─att1─┬─att2─┬─tags─┬─sum(value)─┐
│      │      │      │         12 │
└──────┴──────┴──────┴────────────┘

*/

Upvotes: 0

Related Questions