Reputation: 55
I am looking for a solution to aggregate a Bigquery table with about 100 columns using sum() function. But the following query is not allowed in Bigquery standard SQL.
select sum(*)
from `dataset.Intermediate_Tables.eventCat_dummies`
group by Id
The reason I want to do this kind of aggregation in Bigquery is due to its ability to handle large amount of data. I tried to do the same aggregation in jupyter notebook but it failed everytime. It might be because of the large size of data (7.3 GiB csv file). The code I tried is like the following:
df_type = type_dummies.groupby('Id', sort=False).sum()
Anyone can give any suggestions and/or alternatives that how I can get aggregated data of this large dataset?
UPDATE WITH SAMPLE INPUT AND OUTPUT
Input data
Id col1 col2 col3 col4
1 0 0 0 1
2 0 1 1 1
1 1 0 0 0
4 0 0 0 0
19 0 0 0 0
2 1 1 1 1
Desired output
Id col1_sum col2_sum col3_sum col4_sum
1 1 0 0 1
2 1 2 2 2
4 0 0 0 0
19 0 0 0 0
In my original dataset, there are 100 columns and 40 million rows.
Upvotes: 1
Views: 3691
Reputation: 173191
Below is example for BigQuery StandardSQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 1 a, 2 b, 3 c UNION ALL
SELECT 1, 4, 5, 6 UNION ALL
SELECT 2, 7, 8, 9
)
SELECT id,
SUM((
SELECT SUM(CAST(SPLIT(pair, ':')[SAFE_OFFSET(1)] AS INT64))
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''))) pair
WHERE SPLIT(pair, ':')[SAFE_OFFSET(0)] != '"id"'
)) val
FROM `project.dataset.table` t
GROUP BY id
with result as
Row id val
1 1 21
2 2 24
As you can see - above has no dependency on number of columns
Also, it assumes all columns except of id
column needs to be SUMed. if you have more columns to be excluded - you can adjust WHERE
clause respectively
Update based on provided details
So you want to sum each and every individual column (initially I read your question as if you want to sum all the column's values together by id)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 1 a, 2 b, 3 c UNION ALL
SELECT 1, 4, 5, 6 UNION ALL
SELECT 2, 7, 8, 9
)
SELECT id,
REPLACE(SPLIT(pair, ':')[SAFE_OFFSET(0)], '"', '') col,
SUM(CAST(SPLIT(pair, ':')[SAFE_OFFSET(1)] AS INT64)) val
FROM (
SELECT id,
ARRAY(
SELECT pair
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t1), r'[{}]', ''))) pair
WHERE SPLIT(pair, ':')[SAFE_OFFSET(0)] != '"id"'
) arr
FROM `project.dataset.table` t1
) t2,
UNNEST(arr) pair
GROUP BY id, col
ORDER BY id, col
this will give you below result
Row id col val
1 1 a 5
2 1 b 7
3 1 c 9
4 2 a 7
5 2 b 8
6 2 c 9
this result is flattened version of what you need and in most practical use cases is much more efficient than pivoted one
Still, if you want to pivot this result - see https://stackoverflow.com/a/35808045/5221944
Upvotes: 4