Minato
Minato

Reputation: 462

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.

For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.

So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:

Upvotes: 0

Views: 726

Answers (1)

Tamir Klein
Tamir Klein

Reputation: 3642

As you can not change column type in bq (see this link)

BigQuery natively supports the following schema modifications:

BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE

All other schema modifications are unsupported and require manual workarounds

So as a workaround I suggest:

  • Use --max_rows_per_request = 1 in your script
  • Use 1 line which is the best suitable for your case with the optimized field type.

This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Upvotes: 1

Related Questions