Akhil Yadav
Akhil Yadav

Reputation: 97

Google BigQuery select sum of all the nested fields in a record

I have my schema similar to the below

|- name
+- cars
|  |- tesla integer
|  |- ferrari integer

Let's say every record specifies an order for cars and every order specifies the number of different cars that were ordered.

Now I want to query the table such that I get the sum of all the different car types. So I want something like SELECT SUM(cars.*) from table_name, because I don't know what all possible nested fields the cars might have.

The schema gets dynamically generated every time a new car model is added, and thus in my query I can't possibly specify all the field names because I won't be having them.

example data -

name    | cars.tesla | cars.ferrari
vendor1 |     12     |      10
vendor1 |     5      |      5
vendor2 |     4      |      3   

desired output for vendor1 -

name    | total_tesla | total_ferrari 
vendor1 |     17      |     15

So, I want to select sum of all the fields that are nested under the particular record. Is there a way I can do that?

Upvotes: 1

Views: 1788

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Below is for BigQuery Standard SQL

#standardSQL
SELECT name, SUM(SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) total
FROM `project.dataset.table` t,
UNNEST(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(TO_JSON_STRING(t), '$.cars'), r'^{|}$', ''))) kv
GROUP BY name

You can test / play with above using dummy data as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' name, STRUCT<tesla INT64, ferrari INT64>(1, 2) cars UNION ALL
  SELECT 'b', STRUCT(3,4)
)
SELECT name, SUM(SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) total
FROM `project.dataset.table` t,
UNNEST(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(TO_JSON_STRING(t), '$.cars'), r'^{|}$', ''))) kv
GROUP BY name

result is

name    total    
a       3    
b       7    

Below is simplified / refactored version of above:

#standardSQL
SELECT name, 
  (
    SELECT SUM(SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) 
    FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(cars), r'^{|}$', ''))) kv
  ) total
FROM `project.dataset.table`

And finally "final" version:

#standardSQL
CREATE TEMP FUNCTION SUM_NESTED(root STRING) AS (
  (SELECT SUM(SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) 
  FROM UNNEST(SPLIT(REGEXP_REPLACE(root, r'^{|}$', ''))) kv)
);

SELECT name, SUM_NESTED(TO_JSON_STRING(cars)) total
FROM `project.dataset.table`  

Update to address edited question

Below should give you direction - it gives you flatten result (you should search this site then for how to pivot it - there are plenty of questions/answers on that topic here)

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'vendor1' name, STRUCT<tesla INT64, ferrari INT64>(12, 10) cars UNION ALL
  SELECT 'vendor1', STRUCT(5, 5) UNION ALL
  SELECT 'vendor2', STRUCT(4,3) UNION ALL
  SELECT 'vendor2', STRUCT(1,NULL)
)
SELECT 
  name, 
  REPLACE(SPLIT(kv, ':')[OFFSET(0)], '"', '') car, 
  SUM(SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64)) total
FROM `project.dataset.table`,
UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(cars), r'^{|}$', ''))) kv
GROUP BY name, car
-- ORDER BY name, car  

result is

name        car         total    
vendor1     ferrari     15   
vendor1     tesla       17   
vendor2     ferrari     3    
vendor2     tesla       5    

Upvotes: 3

Related Questions