Reputation: 2820
I am implementing an ETL job that migrates a non partitioned BigQuery Table to a partitioned one. To do so I use the Storage API from BigQuery. This creates a number of sessions to pull Data from. In order to route the BigQuery writes to the right partition I use the File Loads methods. Streaming inserts was not the option due to the limitation of 30 days. Storage Write API seems to be limited identifying the partition.
By residing to the File Load Method the Data are being written to GCS. The issue is that this takes too much time and there is the risk of the sessions to close. Behind the scenes the File Load Method is a complex one with multiple steps. For example writings to GCS and combining the entries to a destination/partition joined file.
Based on the Dataflow processes it seems that nodes can execute workloads on different parts of the pipeline.
How can I avoid the risk of the session closing? Is there a way for my Dataflow nodes to focus only on the critical part which is write to GCS first and once this is done, then focus on all the other aspects?
Upvotes: 1
Views: 206
Reputation: 656
You can do a Reshuffle
right before applying the write to BigQuery. In Dataflow, that will create a checkpoint, and a new stage in the job. The write to BigQuery would start when all steps previous to the reshuffle have finished, and in case of errors and retries, the job would backtrack to that checkpoint.
Please note that doing a reshuffle implies doing a shuffling of data, so there will be a performance impact.
Upvotes: 2