Iván Sánchez
Iván Sánchez

Reputation: 305

Google BigQuery: UNNEST where each different key becomes a column

I have this table with several columns containing dictionaries: payloadKV, metaKV, etc.

example table

I need to unnest the dict and pivot the result to put each key in a column and the value in the correspondent cell of that row,column. The desired output of the screenshot above would be:

+---------------------+-------+---------------+----------------+---------------------+-----+
|   ingestTimestamp   |  ...  | metadata.Area |  metadata.Cell | metadata.Department | ... |
+---------------------+-------+---------------+----------------+---------------------+-----+
| 2022-03-23 02:34:41 |   ... | MC            |           0010 |                0752 | ... |
| ...                 |   ... | ...           |            ... |                 ... | ... |
+---------------------+-------+---------------+----------------+---------------------+-----+

Each of these dictionaries have an arbitrary number of key/values, which can be hundreds, and I cannot know the key names beforehand, so I need some generic expression to extract them.

I have seen examples how to extract the desired keys by hardcoding them, but I cannot seem to find a generic way to do it.

Upvotes: 3

Views: 3359

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

Can be relatively easily done with BigQuery PIVOT along with EXECUTE IMMEDIATE as in below example

create temp table temp as (
  select t.* except(payloadKV, metaKV), replace(key, '.', '_') key, value
  from your_table t, unnest(payloadKV)
  union all
  select t.* except(payloadKV, metaKV), replace(key, '.', '_') key, value
  from your_table t, unnest(metaKV)
);

execute immediate (select '''
  select * from temp pivot (any_value(value) for key in (''' || 
  (select string_agg("'" || key || "'", ',' order by key) from (select distinct key from temp))
  || '''))
''')           

if applied to sample data as in (similar to) your question

select '2022-03-23 02:34:41' ingestTimestamp, [
    struct('payload.Area' as key, 'MC1' as value), ('payload.Cell', '00101'), ('payload.Department', '07521')] payloadKV, [
    struct('metadata.Area' as key, 'MC' as value),('metadata.Cell', '0010'), ('metadata.Department', '0752')] metaKV
union all
select '2022-03-24 02:34:41' ingestTimestamp, [
    struct('payload.Area' as key, 'MC2' as value), ('payload.Cell', '00102'), ('payload.Department', '07522')] payloadKV, [
    struct('metadata.Area' as key, 'MC3' as value),('metadata.Cell', '00103'), ('metadata.Department', '07523')] metaKV

output is

enter image description here

Upvotes: 3

Betjens
Betjens

Reputation: 1401

Its kind of challenging to get such a script. Its possible although you might need to do a lot of coding and try-error. If a table is really extensive you might want to use python ( as suggested by martin weitzmann) to retrieve column information and create your script to get your data.

You can also use only BigQuery but you might find it difficult to implement on large tables but here is my approach, you can try this approach if it fits your scenario:

  1. Create our test table with some records
create or replace table`projectid.dataset.table`
(
    id INT64,
    ingestTimeStamp date,
    payloadKV STRUCT<id INT64,json STRING>,
    metaKV STRUCT<id INT64,description STRING>
)

insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(1,"2022-03-03",(100,'{"kardexid":11,"desc":"d1"}'),(100,"a desc1"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(2,"2022-03-04",(101,'{"kardexid":22,"desc":"d2"}'),(110,"a desc2"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(3,"2022-03-05",(102,'{"kardexid":34,"desc":"d3"}'),(120,"a desc3"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(4,"2022-03-06",(103,'{"kardexid":53,"desc":"d4"}'),(130,"a desc4"));
  1. Lets declare our working variables
declare working_table string;
declare loop_col String;
declare query String;
declare single_col_names String;
declare nested_col_array ARRAY<STRING>;
declare nested_col_string String DEFAULT "";
  1. Set our working variables
# Set columns to work
set working_table = "table";

set single_col_names = (SELECT STRING_AGG(column_name) FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name = working_table and data_type not like 'STRUCT%');

set nested_col_array = (SELECT ARRAY_AGG(column_name) FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name = working_table and data_type like 'STRUCT%');
  1. Get our nested columns
# Retrieve nested columns
FOR record IN
  (SELECT * FROM unnest(nested_col_array) as col_names)
DO   
    SET loop_col = (SELECT CONCAT(column_name,
                                    ".",
                                    REPLACE(ARRAY_TO_STRING(REGEXP_EXTRACT_ALL(data_type,r'[STRUCT<,INT64 STRING ]+(.+?) '),",")
                                            ,",",
                                            CONCAT(",",column_name,".")))
    FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
    where table_name = working_table and data_type like 'STRUCT%' and column_name=record.col_names);

    SET nested_col_string = (SELECT CONCAT(nested_col_string,",",loop_col));
END FOR;
  1. we then finalize by creating our custom query and run it.
# build & run query
set query = (SELECT FORMAT("select %s%s from `projectid.dataset.table` order by 1",single_col_names,nested_col_string));
EXECUTE IMMEDIATE(query);

output:

id ingestTimeStamp id_1 json id_2 description
1 2022-03-03 100 {"kardexid":11,"desc":"d1"} 100 a desc1
2 2022-03-04 101 {"kardexid":22,"desc":"d2"} 110 a desc2
3 2022-03-05 102 {"kardexid":34,"desc":"d3"} 120 a desc3
4 2022-03-06 103 {"kardexid":53,"desc":"d4"} 130 a desc4

As you can see, a process such as this one on BigQuery is quite challenging as you will have to parse your struct types to get the names of your inner columns, do scripting and definitely not optimal. For the sake of the question, this can be done but it's not something I would recommend.

When dealing with BigQuery you usually want to go for less resource invested queries and just pick up what is truly needed. You can use client libraries to perform less operations on BigQuery side and use the code to perform transformations with the data you get from your raw queries.

To create this code I consult the following documentation, check it out:

Upvotes: 1

Martin Weitzmann
Martin Weitzmann

Reputation: 4736

In databases rows are observations and columns describe these observations. You want to non-consistently describe observations - that's a meta level of databases who are only meant to be consistent in this respect.

You can cross join with an unnested array and then pivot but "you" still need to know the column names in advance.

"You" is either you in person or a bit of code that prepares the SQL statement by gathering the information in advance - which can be an automated solution in python for instance. Basically

  1. Gather pivot column information using python+bigquery: flatten the array and get distinct metadata.key values
  2. In python prepare a sql statement with a customized pivot statement using metadata.key information from step 1
  3. run that statement

Upvotes: 0

Related Questions