user_1357
user_1357

Reputation: 7940

BigQuery wildcard schema inference clears valid STRUCT

Given tables, table_1 and table_2below.

CREATE TABLE dataset.table_1(
  timestamp_ TIMESTAMP,
  dimension1 STRUCT< dimension2 STRUCT< field1>>>
)

CREATE TABLE dataset.table_2(
  timestamp_ TIMESTAMP,
  dimension1 STRUCT< dimension2 STRUCT< dimension3 STRUCT<field1>>>
)

When I perform following wildcard search below, returned schema completely replaces schema from table_1 with table_2 for all structs. If fields are not structs, all new fields are appended. Is this an expected behavior? Should't schema inference happen within matched tables via suffix as well rather than just *?

SELECT *
FROM `project.dataset.table_*` 
WHERE _TABLE_SUFFIX = '1'

Upvotes: 0

Views: 217

Answers (2)

Ping
Ping

Reputation: 1

Yes, this is the expected behaviour, though it can be very surprising if you are used to SQL UNION!

For comparison, when you UNION tables:

  • The result has exactly the schema of the first table in the expression.
  • The column names of all the subsequent tables are ignored.
  • The columns are matched up by position (and hence types must match by position).

When you combine tables with a wildcard:

  • The result has exactly the schema of the last created table that matches the wildcard.
  • The columns are matched up by name (and hence types must match by name).
  • Any missing columns are filled in with nulls.

This also applies to STRUCT columns in the tables. So if you UNION two tables with different STRUCTs in the same column (when matched by position), the STRUCT field names will be taken from the first table, the STRUCT fields will be aligned by position, and the field names in the rest of the tables will be ignored. If you wildcard-select two tables with different STRUCTs in the same column (when matched by name), the resulting STRUCT fields will come from the last created table, the STRUCT fields in the rest of the tables will be matched up by field name, and the missing fields will be filled in with nulls.

This implies several counterintuitive effects:

  • Wildcard and UNION are not equivalent. If you have two tables named foo_1 and foo_2, then select * from foo_* can produce different results from select * from foo_1 union all select * from foo_2.

  • Table creation dates matter. Normally you might assume that deleting a table and re-creating it with exactly the same name and contents wouldn't change anything. But if you create tables foo_a and then foo_b and query for a wildcard foo_*, deleting and re-creating foo_a can change the result.

  • Schema can spontaneously change. If you are doing select * from foo_* where foo_* is a date-sharded table (i.e. a group of tables named foo_20240101, foo_20240102, etc.), columns can appear and disappear day by day, as new daily tables are added! A query like select a, b from foo_* can succeed one day and fail the next.

  • Schema can depend on an unused table. The schema comes from the last created table that matches the wildcard, even if that table is entirely unused. For example, select * from foo_* where _table_suffix = 'bar' is not equivalent to select * from foo_bar; the former can change when a new table named foo_abc is created, but the latter cannot.

  • Type errors can be masked. If tables in the wildcard have incompatible data types, there will be a type error only if an incompatible column appears in the schema determined by the last table. For example, if foo_a has columns (x STRING, y INT) and foo_b has columns (x STRING, y STRING), then select * from foo_* will fail with a type error. But if an additional table foo_c is created with columns (x STRING, z INT), this will magically cause select * from foo_* to succeed! But, of course, if you delete and re-create foo_a with the same name and contents, it will fail again.

This has particularly confusing effects when Cloud Logging is involved. A common use of wildcards is to combine all the daily tables generated by a Cloud Logging sink. Cloud Logging generates date-sharded tables, one per day. Each table is initially created with the schema of the first log entry of the day, and more fields can be added over the course of the day as new fields appear (such as fields within the STRUCT column "labels") in subsequent log entries. So the "schema of the last created table" will be constantly changing!

A wildcard select over Cloud Logging tables can have fewer fields in the morning, then more fields at night, fewer fields the next morning, more fields the next night, and so on, with historical data fields appearing and disappearing periodically every day. A view defined over these tables can therefore fail in the morning, succeed in the evening, fail the next morning, etc. or just fail all day or succeed all day depending what got logged that day.

Upvotes: 0

Elliott Brossard
Elliott Brossard

Reputation: 33705

From the documentation:

BigQuery uses the schema for the most recently created table that matches the wildcard as the schema for the wildcard table.

Said differently, the reference schema comes from the most recent table that matches the wildcard, and the filter on the _TABLE_SUFFIX pseudo column is irrelevant in determining the schema.

Upvotes: 1

Related Questions