Reputation: 353

Is google dataflow BQ/BT Write atomic per job?

maybe I am a bad seeker but I couldn't find my answers in documentation, so I just want to try my luck here

So my question is that say I have a dataflow job that write to a BigQuery or BigTable and the job failed. Will dataflow will able to rollback to state before it started or there might simply be partial data in my table?

I know that write to GCS seems not atomic that there will be partial output partition produced along the way when the job is running.

However, I have tried dumping data into BQ by dataflow, and it seems that the output table will not be exposed to users until the job claimed success.

Upvotes: 1

Answers (3)

Dan Halperin

Reputation: 2247

In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table"):

Write all data to a temporary directory on GCS.
Issue a BigQuery load job with an explicit list of all the temporary files containing the rows to be written.

If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.

Each BigQuery load job, as in William V's answer, is atomic. The load job will succeed or fail, and if it fails there will be no data written to BigQuery.

For slightly more depth, Dataflow also uses a deterministic BigQuery job id (like dataflow_job_12423423) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.

Together, this design means that each BigQueryIO.Write transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.

However: Note that if you have multiple BigQueryIO.Write transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails. This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.

Upvotes: 4

William Vambenepe

Reputation: 150

BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs

Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.

Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.

Upvotes: 1

Solomon Duskis

Reputation: 2711

I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.

Upvotes: 1

Is google dataflow BQ/BT Write atomic per job?

Answers (3)

Related Questions