Reputation: 664
Is it possible to create a repeated record column in BigQuery? For example, for the following data:
| a | b | c |
-------------
| 1 | 5 | 2 |
-------------
| 1 | 3 | 1 |
-------------
| 2 | 2 | 1 |
Is the following possible?
Select a, NEST(b, c) as d from *table* group by a
To produce the following results
| a | d.b | d.c |
-----------------
| 1 | 5 | 2 |
-----------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |
Upvotes: 11
Views: 11993
Reputation: 172993
With introduction of BigQuery Standard SQL we've got easy way to deal with records
Try below, Don't forget to uncheck Use Legacy SQL
checkbox under Show Options
WITH YourTable AS (
SELECT 1 AS a, 5 AS b, 2 AS c UNION ALL
SELECT 1 AS a, 3 AS b, 1 AS c UNION ALL
SELECT 2 AS a, 2 AS b, 1 AS c
)
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS d
FROM YourTable
GROUP BY a
Upvotes: 12
Reputation: 172993
One of the way to go around NEST()
limitation of "nesting" just one field is to use BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. And most important from my prospective to have in mind - they are quite a candidates for being qualified as expensive High-Compute queries
Complex queries can consume extraordinarily large computing resources relative to the number of bytes processed. Typically, such queries contain a very large number of JOIN or CROSS JOIN clauses or complex User-defined Functions.
So, below is example that "mimic" NEST(b, c) from example in questino:
SELECT a, d.b, d.c FROM
JS(( // input table
SELECT a, NEST(CONCAT(STRING(b), ',', STRING(c))) AS d
FROM (
SELECT * FROM
(SELECT 1 AS a, 5 AS b, 2 AS c),
(SELECT 1 AS a, 3 AS b, 1 AS c),
(SELECT 2 AS a, 2 AS b, 1 AS c)
) GROUP BY a),
a, d, // input columns
"[{'name': 'a', 'type': 'INTEGER'}, // output schema
{'name': 'd', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'b', 'type': 'STRING'},
{'name': 'c', 'type': 'STRING'}
]
}
]",
"function(row, emit){ // function
var c = [];
for (var i = 0; i < row.d.length; i++) {
x = row.d[i].toString().split(',');
t = {b:x[0], c:x[1]}
c.push(t);
};
emit({a: row.a, d: c});
}"
)
It is relatively straightforward. I hope you will be able to walk through it and get an idea
Still - remember:
No matter how you create record with nested/repeated fields - BigQuery automatically flattens query results, so visible results won't contain repeated fields. So you should use it as a subselect that produces intermediate results for immediate use by the same query.
As FYI, you can prove for yourself that above returns only two records (not three as it is looks like when it is flattened) by running below query
SELECT COUNT(1) AS rows FROM (
<above query here>
)
Another important NOTE:
This is a known that NEST()
is not compatible with UnFlatten Results
Output and mostly is used for intermediate result in subquery.
In contrast, above solution can be easily saved directly to table (with unchecked Flatten Results)
Upvotes: 6
Reputation: 172993
BigQuery automatically flattens query results, so if you use the NEST function on the top level query, the results won't contain repeated fields. Use the NEST function when using a subselect that produces intermediate results for immediate use by the same query.
See more about NEST() at https://cloud.google.com/bigquery/query-reference#aggfunctions
Also check Big Query - Group By Clause not working with NEST() as FYI
Another point to have in mind - yo can NEST only one field - NEST(b)
but not NEST(b, c)
That said - you can produce result similar to what you asked but you need to write it to the table
From my experience: users more facing this questions while loading data into BigQuery - that is were user can use nlJSON with as complex schema as needed. In GBQ itself user usually more interested in analysis , aggregation so above type questions are rised less frequently. I think current GBQ sysntax is not friendly/flexible enough to generate data of "complex" hierarchical/nested schema and insert it into table all in GBQ only. Still, I think workarounds are possible but depends on specific use case
Upvotes: 0