rish0097
rish0097

Reputation: 1094

Get/Set BigQuery Job ID while doing BigQueryIO.write()

Is it possible to set BigQuery JobID or to get it while the batch pipeline is running.
I know it's possible using BigQuery API but is it possible if I'm using BigQueryIO from Apache Beam? I need to send an acknowledgement after writing to BigQuery that the load is complete.

Upvotes: 1

Views: 545

Answers (1)

jkff
jkff

Reputation: 17913

Currently this is not possible. It is complicated by the fact that a single BigQueryIO.write() may use many BigQuery jobs under the hood (i.e. BigQueryIO.write() is a general-purpose API for writing data to BigQuery, rather than an API for working with a single specific BigQuery load job), e.g.:

  • In case the amount of data to be loaded is larger than the BigQuery limits for a single load job, BigQueryIO.write() will shard it into multiple load jobs.
  • In case you are using one of the destination-dependent write methods (e.g. DynamicDestinations), and are loading into multiple tables at the same time, there'll be at least 1 load job per table.
  • In case you are writing an unbounded PCollection using the BATCH_LOADS method, it will periodically issue load jobs for newly arrived data, subject to the notes above.
  • In case you're using the STREAMING_INSERTS method (it is allowed to use it even if you're writing a bounded PCollection), there will be no load jobs at all.

You will need to use one of the typical workarounds for "doing something after something else is done", which is, e.g. wait until the entire pipeline is done using pipeline.run().waitUntilFinish() in your main program and then do your second action.

Upvotes: 5

Related Questions