Reputation: 3178
I am trying to think of how to architect some data pipeline needs, and I simply want to know if the following is possible:
I know that I can do this in other ways. For instance, I can create 2 dataflow jobs / pipelines that listen to the same subscription. (Or would it be better to have 2 separate subscriptions listening to the same topic?) I can also create a subscription for the dataflow job, then another subscription (to the same topic) that just pushes to BigQuery immediately.
But if I could have one set of code -- and therefore one CI/CD job to monitor -- to accomplish both, it simplifies what we need to maintain, and it would be much preferred.
Is this possible?
Upvotes: 0
Views: 185
Reputation: 656
Yes, that is possible. For writing in BigQuery without using a window, you will need to use method
equal STREAMING_INSERTS or STORAGE_WRITE_API (maybe with use_at_least_once
set to True
for even lower latency, but with risk of duplicates).
For the rest of the pipeline, you just need to have two branches. Something like this:
msgs = p | "Read from P/S" >> beam.io.ReadFromPubsub(...)
dictionaries = msgs | "Transform to dict with some schema" >> ...
dictionaries | beam.io.gcp.bigquery.WriteToBigQuery(...,method=STORAGE_WRITE_API, use_at_least_once=True,...)
dictionaries | beam.WindowInto(FixedWindows(size=300)) | ... # (aggregate, write to another BigQuery table, etc)
Upvotes: 1