Reputation: 85
A partner push a table to our BigQuery DW every day, with the following structure (dummy names):
table names: entry_2020_06_16, entry_2020_06_17, entry_2020_06_18 ... entry_2022_12_31
schema: col_1, col_2, col_3... , col_??
The important thing here is that the number of columns varies in the daily files.
In table entry_2020_06_16 it might go from col_1, col_2, ... col_149.
In table entry_2020_06_17 it might go from col_1, col_2, ... col_23.
I would like to union these files such that no data is lost and missing values are set to null. Furthermore, it needs to be dynamic and should run within the BigQuery environment.
Unfortunately, the "from entry_*" trick does not work as it drops columns not common to all tables.
In the example below, i would like to end up with a table looking like:
columns: col_1, col_2, col_3
row 1: x, y, z
row 2: a, b, null
create or replace table temp.entry_2020_06_10 as select "x" as col_1, "y" as col_2, "z" as col_3
create or replace table temp.entry_2020_06_11 as select "a" as col_1, "b" as col_2
select * from `temp.entry_20*`
Any creative ideas? :)
Upvotes: 1
Views: 1033
Reputation: 314
you can use the INFORMATION_SCHEMA.COLUMNS with execute immediate to generate a dynamic query:
execute immediate (SELECT
string_agg(CONCAT('select ',columns,' from `project-id.dataset-id.',table_name,'`'),'\nUnion all\n')
FROM (
SELECT
table_column.table_name,
STRING_AGG(CASE
WHEN x.column_name IS NULL THEN CONCAT('null as ',table_column.column_name)
ELSE
table_column.column_name
END
,', '
ORDER BY
table_column.column_name) columns
FROM (
SELECT
tables.table_name,
columns.column_name
FROM (
SELECT
DISTINCT column_name
FROM
`project-id.dataset-id.INFORMATION_SCHEMA.COLUMNS` a where table_name like 'entry_%') columns
FULL OUTER JOIN (
SELECT
DISTINCT table_name
FROM
`project-id.dataset-id.INFORMATION_SCHEMA.COLUMNS` b where table_name like 'entry_%') tables
ON
(1=1)) table_column
LEFT JOIN
`project-id.dataset-id.INFORMATION_SCHEMA.COLUMNS` x
ON
(x.table_name = table_column.table_name
AND x.column_name = table_column.column_name)
GROUP BY
1 )) ;
Upvotes: 3