Avoid session shutdown on BigQuery Storage API with Dataflow

Question

I am implementing an ETL job that migrates a non partitioned BigQuery Table to a partitioned one. To do so I use the Storage API from BigQuery. This creates a number of sessions to pull Data from. In order to route the BigQuery writes to the right partition I use the File Loads methods. Streaming inserts was not the option due to the limitation of 30 days. Storage Write API seems to be limited identifying the partition.

By residing to the File Load Method the Data are being written to GCS. The issue is that this takes too much time and there is the risk of the sessions to close. Behind the scenes the File Load Method is a complex one with multiple steps. For example writings to GCS and combining the entries to a destination/partition joined file.

Based on the Dataflow processes it seems that nodes can execute workloads on different parts of the pipeline.

How can I avoid the risk of the session closing? Is there a way for my Dataflow nodes to focus only on the critical part which is write to GCS first and once this is done, then focus on all the other aspects?

Avoid session shutdown on BigQuery Storage API with Dataflow

Answers (1)

Related Questions