Reputation: 305
I have this table with several columns containing dictionaries: payloadKV, metaKV, etc.
I need to unnest the dict and pivot the result to put each key in a column and the value in the correspondent cell of that row,column. The desired output of the screenshot above would be:
+---------------------+-------+---------------+----------------+---------------------+-----+
| ingestTimestamp | ... | metadata.Area | metadata.Cell | metadata.Department | ... |
+---------------------+-------+---------------+----------------+---------------------+-----+
| 2022-03-23 02:34:41 | ... | MC | 0010 | 0752 | ... |
| ... | ... | ... | ... | ... | ... |
+---------------------+-------+---------------+----------------+---------------------+-----+
Each of these dictionaries have an arbitrary number of key/values, which can be hundreds, and I cannot know the key names beforehand, so I need some generic expression to extract them.
I have seen examples how to extract the desired keys by hardcoding them, but I cannot seem to find a generic way to do it.
Upvotes: 3
Views: 3359
Reputation: 172974
Can be relatively easily done with BigQuery PIVOT
along with EXECUTE IMMEDIATE
as in below example
create temp table temp as (
select t.* except(payloadKV, metaKV), replace(key, '.', '_') key, value
from your_table t, unnest(payloadKV)
union all
select t.* except(payloadKV, metaKV), replace(key, '.', '_') key, value
from your_table t, unnest(metaKV)
);
execute immediate (select '''
select * from temp pivot (any_value(value) for key in (''' ||
(select string_agg("'" || key || "'", ',' order by key) from (select distinct key from temp))
|| '''))
''')
if applied to sample data as in (similar to) your question
select '2022-03-23 02:34:41' ingestTimestamp, [
struct('payload.Area' as key, 'MC1' as value), ('payload.Cell', '00101'), ('payload.Department', '07521')] payloadKV, [
struct('metadata.Area' as key, 'MC' as value),('metadata.Cell', '0010'), ('metadata.Department', '0752')] metaKV
union all
select '2022-03-24 02:34:41' ingestTimestamp, [
struct('payload.Area' as key, 'MC2' as value), ('payload.Cell', '00102'), ('payload.Department', '07522')] payloadKV, [
struct('metadata.Area' as key, 'MC3' as value),('metadata.Cell', '00103'), ('metadata.Department', '07523')] metaKV
output is
Upvotes: 3
Reputation: 1401
Its kind of challenging to get such a script. Its possible although you might need to do a lot of coding and try-error. If a table is really extensive you might want to use python ( as suggested by martin weitzmann
) to retrieve column information and create your script to get your data.
You can also use only BigQuery but you might find it difficult to implement on large tables but here is my approach, you can try this approach if it fits your scenario:
create or replace table`projectid.dataset.table`
(
id INT64,
ingestTimeStamp date,
payloadKV STRUCT<id INT64,json STRING>,
metaKV STRUCT<id INT64,description STRING>
)
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(1,"2022-03-03",(100,'{"kardexid":11,"desc":"d1"}'),(100,"a desc1"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(2,"2022-03-04",(101,'{"kardexid":22,"desc":"d2"}'),(110,"a desc2"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(3,"2022-03-05",(102,'{"kardexid":34,"desc":"d3"}'),(120,"a desc3"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(4,"2022-03-06",(103,'{"kardexid":53,"desc":"d4"}'),(130,"a desc4"));
declare working_table string;
declare loop_col String;
declare query String;
declare single_col_names String;
declare nested_col_array ARRAY<STRING>;
declare nested_col_string String DEFAULT "";
# Set columns to work
set working_table = "table";
set single_col_names = (SELECT STRING_AGG(column_name) FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name = working_table and data_type not like 'STRUCT%');
set nested_col_array = (SELECT ARRAY_AGG(column_name) FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name = working_table and data_type like 'STRUCT%');
# Retrieve nested columns
FOR record IN
(SELECT * FROM unnest(nested_col_array) as col_names)
DO
SET loop_col = (SELECT CONCAT(column_name,
".",
REPLACE(ARRAY_TO_STRING(REGEXP_EXTRACT_ALL(data_type,r'[STRUCT<,INT64 STRING ]+(.+?) '),",")
,",",
CONCAT(",",column_name,".")))
FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name = working_table and data_type like 'STRUCT%' and column_name=record.col_names);
SET nested_col_string = (SELECT CONCAT(nested_col_string,",",loop_col));
END FOR;
# build & run query
set query = (SELECT FORMAT("select %s%s from `projectid.dataset.table` order by 1",single_col_names,nested_col_string));
EXECUTE IMMEDIATE(query);
output:
id | ingestTimeStamp | id_1 | json | id_2 | description |
---|---|---|---|---|---|
1 | 2022-03-03 | 100 | {"kardexid":11,"desc":"d1"} | 100 | a desc1 |
2 | 2022-03-04 | 101 | {"kardexid":22,"desc":"d2"} | 110 | a desc2 |
3 | 2022-03-05 | 102 | {"kardexid":34,"desc":"d3"} | 120 | a desc3 |
4 | 2022-03-06 | 103 | {"kardexid":53,"desc":"d4"} | 130 | a desc4 |
As you can see, a process such as this one on BigQuery is quite challenging as you will have to parse your struct types to get the names of your inner columns, do scripting and definitely not optimal. For the sake of the question, this can be done but it's not something I would recommend.
When dealing with BigQuery you usually want to go for less resource invested queries and just pick up what is truly needed. You can use client libraries to perform less operations on BigQuery side and use the code to perform transformations with the data you get from your raw queries.
To create this code I consult the following documentation, check it out:
Upvotes: 1
Reputation: 4736
In databases rows are observations and columns describe these observations. You want to non-consistently describe observations - that's a meta level of databases who are only meant to be consistent in this respect.
You can cross join with an unnested array and then pivot but "you" still need to know the column names in advance.
"You" is either you in person or a bit of code that prepares the SQL statement by gathering the information in advance - which can be an automated solution in python for instance. Basically
Upvotes: 0