Reputation: 353
maybe I am a bad seeker but I couldn't find my answers in documentation, so I just want to try my luck here
So my question is that say I have a dataflow job that write to a BigQuery or BigTable and the job failed. Will dataflow will able to rollback to state before it started or there might simply be partial data in my table?
I know that write to GCS seems not atomic that there will be partial output partition produced along the way when the job is running.
However, I have tried dumping data into BQ by dataflow, and it seems that the output table will not be exposed to users until the job claimed success.
Upvotes: 1
Views: 666
Reputation: 2247
In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table")
:
load
job with an explicit list of all the temporary files containing the rows to be written.If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.
Each BigQuery load
job, as in William V's answer, is atomic. The load
job will succeed or fail, and if it fails there will be no data written to BigQuery.
For slightly more depth, Dataflow also uses a deterministic BigQuery job id
(like dataflow_job_12423423
) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.
Together, this design means that each BigQueryIO.Write
transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.
However: Note that if you have multiple BigQueryIO.Write
transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails.
This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.
Upvotes: 4
Reputation: 150
BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.
Upvotes: 1
Reputation: 2711
I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.
Upvotes: 1