lookbadgers
lookbadgers

Reputation: 988

Wildcard tables and differences in table columns

I am try to use BigQuery to JOIN over a range of tables using Wildcard Tables.

The query works when all tables matched by the wildcard have the column fooid (bar*.fooid). However this column is a recent addition and when the table wildcard matches tables where the field does not exist the query fails.

Error: Cannot read non-required field 'fooid' as required INT64. 

This is a simplified version of the query to demonstrate the issue, it would be selecting more columns from both foo and bar.

SELECT foo.foo_id AS foo
FROM `bar.bar*` AS bar_alias
LEFT JOIN bar.foo_map foo ON (bar_alias.fooid = foo.foo_id)
WHERE (_TABLE_SUFFIX BETWEEN  '20170206' AND '20170208')

I've looked at a number of answers including BigQuery IF field exists THEN, but can't see how to use them in conjunction with a JOIN or when the tables without the column are not known.

Upvotes: 0

Views: 2485

Answers (1)

Elliott Brossard
Elliott Brossard

Reputation: 33705

Here is an example of how this can arise, and how to fix it by using a reference schema from an empty table where the column/field is NULLABLE. Suppose I have the following two tables:

$ bq query --use_legacy_sql=false \
  "CREATE TABLE tmp_elliottb.bar20180328 (y STRING) AS SELECT 'bar';"

$ bq query --use_legacy_sql=false \
  "CREATE TABLE tmp_elliottb.bar20180329 " \
  "(x INT64 NOT NULL, y STRING) AS SELECT 1, 'foo';"

Column x has the NOT NULL attribute in the second table, but the column is missing from the first table. I get an error when I try to use a table wildcard:

$ bq query --use_legacy_sql=false \
  "SELECT * FROM \`tmp_elliottb.bar*\` " \
  "WHERE _TABLE_SUFFIX BETWEEN '20180301' AND '20180329';"
Waiting on <job id> ... (0s) Current status: DONE   
Error in query string: Error processing job '<job id>': Cannot read non-required field 'x' as required INT64.
Failure details:
- query: Cannot read non-required field 'x' as required INT64.

This makes sense--I said that x is NOT NULL, but the bar20180328 table doesn't have the column. Now if I create a new table that matches the * expansion, but where the column doesn't have NOT NULL:

$ bq query --use_legacy_sql=false \
  "CREATE TABLE tmp_elliottb.bar_empty (x INT64, y STRING);"
$ bq query --use_legacy_sql=false \
  "SELECT * FROM \`tmp_elliottb.bar*\` " \
  "WHERE _TABLE_SUFFIX BETWEEN '20180301' AND '20180329';"
...
+------+-----+
|  x   |  y  |
+------+-----+
|    1 | foo |
| NULL | bar |
+------+-----+

I get results instead of an error. In your case, you need to create a table with the expected schema called bar_empty, for example, but where none of the fields/columns that are missing for other tables have a NOT NULL attribute.

With that said, I would strongly recommend using a partitioned table instead, if possible. Among other benefits, partitioned tables are much nicer to work with because they have a consistent schema across all days.

Upvotes: 2

Related Questions