Reputation: 1094
I have a pipeline with a BigQuery table as sink. I need to perform some steps exactly after data has been written to BigQuery. Those steps include performing queries on that table, read data from it and write to a different table.
How to achieve the above? Should I create a different pipeline for the latter but then calling it after the 1st pipeline will be another problem I assume.
If none of the above work, is it possible to call another dataflow job(template) from a running pipeline.
Really need some help with this.
Thanks.
Upvotes: 2
Views: 2055
Reputation: 1067
A workaround I have been using with templates is writing the result of IO operations to a metadata file into a specific bucket, a cloud function (that is my orchestrator) gets triggered, and that, in turn, triggers the following pipeline. However, I tested it only with TextIO operations. So, in your case:
Pretty sure a similar approach can be easily replicated using PubSub instead of writing to buckets (e.g. see here for the second step in my list)
Upvotes: 0
Reputation: 17913
This is currently not explicitly supported by BigQueryIO. The only workaround is to use separate pipelines: start the first pipeline, wait for it to finish (eg. using pipeline.run().waitUntilFinish()
), start the second pipeline (make sure to use a separate Pipeline object for it - reusing the same object multiple times is not supported).
Upvotes: 1