Airflow Pipeline CSV to BigQuery with Schema Changes

Question

Background

I'm need to design an Airflow pipeline to load CSV's into BigQuery.

I know the CSV's frequently have a changing schema. After loading the first file the schema might be

id | ps_1 | ps_1_value

when the second file lands and I load it it might look like

id | ps_1 | ps_1_value | ps_1 | ps_2_value.

Question

What's the best approach to handling this?

My first thought on approaching this would be

Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows

I would do this in a PythonOperator.

If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.

Thanks for the feedback.

rmesteves · Accepted Answer

Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data. Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.

Given that, your DAG could be very similar to your current DAG:

Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.

Finally, I would like to point some links that might be useful to you:

AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.

I hope it helps

Airflow Pipeline CSV to BigQuery with Schema Changes

Background

Question

Answers (2)

Related Questions