Reputation: 1986
Let's say I have some data that looks like this:
+--------+-----------+-------------+----------+------------+
| person | red_apple | green_apple | red_pear | green_pear |
+--------+-----------+-------------+----------+------------+
| bill | 4 | 1 | 1 | 4 |
| bill | 0 | 1 | 2 | 0 |
| jill | 2 | 1 | 1 | 4 |
| jill | 0 | 0 | 2 | 0 |
+--------+-----------+-------------+----------+------------+
And I would like to transpose and sum the data to produce output that looks like this:
+--------+-------+-------+-------+
| person | color | fruit | value |
+--------+-------+-------+-------+
| bill | red | apple | 4 |
| jill | red | apple | 2 |
| bill | green | apple | 2 |
| jill | green | apple | 1 |
| bill | red | pear | 3 |
| jill | red | pear | 3 |
| bill | green | pear | 4 |
| jill | green | pear | 4 |
+--------+-------+-------+-------+
How can I do such a thing? Let's say that in practice, there are actually a lot more fruits, colors, and people than I have listed for this simple example. I am trying to solve two problems: 1) making the SQL not a hairy beast to read and maintain, and 2) making the query efficient, as the actual underlying table is large.
The best I have so far is something like this, using macros:
DEFINE MACRO GROUP_SUM (
SELECT
person,
"$1" as color,
"$2" as fruit,
SUM($1_$2) as value
FROM source_table
GROUP BY 1, 2
);
DEFINE MACRO ALL_COLORS $GROUP_SUM(red,$1) UNION ALL $GROUP_SUM(green,$1);
SELECT
person,
color,
fruit,
SUM(value) as value
FROM
($ALL_COLORS(apple)
UNION ALL $ALL_COLORS(pear))
GROUP BY 1, 2, 3;
It's okay from a maintainability perspective, but when we get out to 5 or 10 colors and fruits, I assume based on the amount of time it's taking to run the query and my assumptions about how sub-queries work that we end up re-scanning the table a number of times.
Is there a way to improve this query's efficiency?
Upvotes: 0
Views: 192
Reputation: 173191
Below is for BigQuery Standard SQL
#standardSQL
SELECT
person,
REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'"(.+)_') AS color,
REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'_(.+)"') AS fruit,
SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) AS value
FROM `project.dataset.table` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r',(".+?":.+?)')) kv
GROUP BY person, color, fruit
if to apply to sample data in your question - result is
You can test with CTE built off of your sample data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'bill' person, 4 red_apple, 1 green_apple, 1 red_pear, 4 green_pear UNION ALL
SELECT 'bill', 0, 1, 2, 0 UNION ALL
SELECT 'jill', 2, 1, 1, 4 UNION ALL
SELECT 'jill', 0, 0, 2, 0
)
SELECT
person,
REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'"(.+)_') AS color,
REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'_(.+)"') AS fruit,
SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) AS value
FROM `project.dataset.table` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r',(".+?":.+?)')) kv
GROUP BY person, color, fruit
Upvotes: 0
Reputation: 1270993
Try disaggregating and aggregating once:
select person, color, fruit, sum(value)
from ((select person, 'red' as color, 'apple' as fruit, red_apple as value
from t
) union all
(select person, 'green' as color, 'apple' as fruit, green_apple
from t
) union all
(select person, 'red' as color, 'pear' as fruit, red_pear
from t
) union all
(select person, 'green' as color, 'pear' as fruit, green_pear
from t
)
) ap
group by person, color, fruit;
Depending on your data, this should be faster than three separate aggregations.
There may be faster methods, depending on your database.
Upvotes: 1