Dan
Dan

Reputation: 1986

How can I transpose and sum a group of columns efficiently in SQL?

Let's say I have some data that looks like this:

+--------+-----------+-------------+----------+------------+
| person | red_apple | green_apple | red_pear | green_pear |
+--------+-----------+-------------+----------+------------+
| bill   | 4         | 1           | 1        | 4          |
| bill   | 0         | 1           | 2        | 0          |
| jill   | 2         | 1           | 1        | 4          |
| jill   | 0         | 0           | 2        | 0          |
+--------+-----------+-------------+----------+------------+

And I would like to transpose and sum the data to produce output that looks like this:

+--------+-------+-------+-------+
| person | color | fruit | value |
+--------+-------+-------+-------+
| bill   | red   | apple | 4     |
| jill   | red   | apple | 2     |
| bill   | green | apple | 2     |
| jill   | green | apple | 1     |
| bill   | red   | pear  | 3     |
| jill   | red   | pear  | 3     |
| bill   | green | pear  | 4     |
| jill   | green | pear  | 4     |
+--------+-------+-------+-------+

How can I do such a thing? Let's say that in practice, there are actually a lot more fruits, colors, and people than I have listed for this simple example. I am trying to solve two problems: 1) making the SQL not a hairy beast to read and maintain, and 2) making the query efficient, as the actual underlying table is large.

The best I have so far is something like this, using macros:

DEFINE MACRO GROUP_SUM (
  SELECT
    person,
    "$1" as color,
    "$2" as fruit,
    SUM($1_$2) as value
  FROM source_table
  GROUP BY 1, 2
);

DEFINE MACRO ALL_COLORS $GROUP_SUM(red,$1) UNION ALL $GROUP_SUM(green,$1);

SELECT
  person,
  color,
  fruit,
  SUM(value) as value
FROM
  ($ALL_COLORS(apple)
   UNION ALL $ALL_COLORS(pear))
GROUP BY 1, 2, 3;

It's okay from a maintainability perspective, but when we get out to 5 or 10 colors and fruits, I assume based on the amount of time it's taking to run the query and my assumptions about how sub-queries work that we end up re-scanning the table a number of times.

Is there a way to improve this query's efficiency?

Upvotes: 0

Views: 192

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173191

Below is for BigQuery Standard SQL

#standardSQL
SELECT 
  person,
  REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'"(.+)_') AS color, 
  REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'_(.+)"') AS fruit, 
  SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) AS value 
FROM `project.dataset.table` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r',(".+?":.+?)')) kv
GROUP BY person, color, fruit   

if to apply to sample data in your question - result is

enter image description here

You can test with CTE built off of your sample data as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'bill' person, 4 red_apple, 1 green_apple, 1 red_pear, 4 green_pear UNION ALL
  SELECT 'bill', 0, 1, 2, 0 UNION ALL
  SELECT 'jill', 2, 1, 1, 4 UNION ALL
  SELECT 'jill', 0, 0, 2, 0 
)
SELECT 
  person,
  REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'"(.+)_') AS color, 
  REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'_(.+)"') AS fruit, 
  SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) AS value 
FROM `project.dataset.table` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r',(".+?":.+?)')) kv
GROUP BY person, color, fruit

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270993

Try disaggregating and aggregating once:

select person, color, fruit, sum(value)
from ((select person, 'red' as color, 'apple' as fruit, red_apple as value
       from t
      ) union all
      (select person, 'green' as color, 'apple' as fruit, green_apple
       from t
      ) union all
      (select person, 'red' as color, 'pear' as fruit, red_pear
       from t
      ) union all
      (select person, 'green' as color, 'pear' as fruit, green_pear
       from t
      )
     ) ap
group by person, color, fruit;

Depending on your data, this should be faster than three separate aggregations.

There may be faster methods, depending on your database.

Upvotes: 1

Related Questions