Writing and reading to and from BigQuery in one Dataflow job

Question

I have a seemingly simple problem when constructing my pipeline for Dataflow. I have multiple pipelines that fetch data from external sources, transform the data and write it to several BigQuery tables. When this process is done I would like to run a query that queries the just generated tables. Ideally I would like this to happen in the same job.

Is this the way Dataflow is meant to be used, or should the loading to BigQuery and the querying of the tables be split up between jobs?

If this is possible in the same job how would one solve this, as the BigQuerySink does not generate a PCollection? If this is not possible in the same job, is there some way to trigger a job on the completion of another job (i.e. the writing job and the querying job)?

Ben Chambers · Accepted Answer

You alluded to what would need to happen to do this in a single job -- the BigQuerySink would need to produce a PCollection. Even if it is empty, you could then use it as the input to the step that reads from BigQuery in a way that made that step wait until the first sink was done.

You would need to create your own version of the BigQuerySink to to do this.

If possible, an easier option might be to have the second step read from the collection that you wrote to BigQuery rather than reading the table you just put into BigQuery. For example:

PCollection rows = ...;
rows.apply(BigQuery.Write.to(...));
rows.apply(/* rest of the pipeline */);

You could even do this earlier if you wanted to continue processing the elements written to BigQuery rather than the table rows.

Writing and reading to and from BigQuery in one Dataflow job

Answers (1)

Related Questions