Union many tables with different schemas in BigQuery

Question

A partner push a table to our BigQuery DW every day, with the following structure (dummy names):

table names: entry_2020_06_16, entry_2020_06_17, entry_2020_06_18 ... entry_2022_12_31
schema: col_1, col_2, col_3... , col_??

The important thing here is that the number of columns varies in the daily files.
In table entry_2020_06_16 it might go from col_1, col_2, ... col_149.
In table entry_2020_06_17 it might go from col_1, col_2, ... col_23.

I would like to union these files such that no data is lost and missing values are set to null. Furthermore, it needs to be dynamic and should run within the BigQuery environment.

Unfortunately, the "from entry_*" trick does not work as it drops columns not common to all tables.

In the example below, i would like to end up with a table looking like:
columns: col_1, col_2, col_3
row 1: x, y, z
row 2: a, b, null

create or replace table temp.entry_2020_06_10 as select "x" as col_1, "y" as col_2, "z" as col_3
create or replace table temp.entry_2020_06_11 as select "a" as col_1, "b" as col_2
select * from `temp.entry_20*`

Any creative ideas? :)

Joe · Accepted Answer

you can use the INFORMATION_SCHEMA.COLUMNS with execute immediate to generate a dynamic query:

execute immediate  (SELECT
  string_agg(CONCAT('select ',columns,' from `project-id.dataset-id.',table_name,'`'),'
Union all
')
FROM (
  SELECT
    table_column.table_name,
    STRING_AGG(CASE
        WHEN x.column_name IS NULL THEN CONCAT('null as ',table_column.column_name)
      ELSE
      table_column.column_name
    END
      ,', '
    ORDER BY
      table_column.column_name) columns
  FROM (
    SELECT
      tables.table_name,
      columns.column_name
    FROM (
      SELECT
        DISTINCT column_name
      FROM
        `project-id.dataset-id.INFORMATION_SCHEMA.COLUMNS` a where table_name like 'entry_%') columns
    FULL OUTER JOIN (
      SELECT
        DISTINCT table_name
      FROM
        `project-id.dataset-id.INFORMATION_SCHEMA.COLUMNS` b where table_name like 'entry_%') tables
    ON
      (1=1)) table_column
  LEFT JOIN
    `project-id.dataset-id.INFORMATION_SCHEMA.COLUMNS` x
  ON
    (x.table_name = table_column.table_name
      AND x.column_name = table_column.column_name)
  GROUP BY
    1 )) ;

Union many tables with different schemas in BigQuery

Answers (1)

Related Questions