Mike Williamson
Mike Williamson

Reputation: 3178

Is it possible to create a Beam pipeline with multiple windowing needs

I am trying to think of how to architect some data pipeline needs, and I simply want to know if the following is possible:


I know that I can do this in other ways. For instance, I can create 2 dataflow jobs / pipelines that listen to the same subscription. (Or would it be better to have 2 separate subscriptions listening to the same topic?) I can also create a subscription for the dataflow job, then another subscription (to the same topic) that just pushes to BigQuery immediately.

But if I could have one set of code -- and therefore one CI/CD job to monitor -- to accomplish both, it simplifies what we need to maintain, and it would be much preferred.

Is this possible?

Upvotes: 0

Views: 185

Answers (1)

Israel Herraiz
Israel Herraiz

Reputation: 656

Yes, that is possible. For writing in BigQuery without using a window, you will need to use method equal STREAMING_INSERTS or STORAGE_WRITE_API (maybe with use_at_least_once set to True for even lower latency, but with risk of duplicates).

For the rest of the pipeline, you just need to have two branches. Something like this:

msgs = p | "Read from P/S" >> beam.io.ReadFromPubsub(...)
dictionaries = msgs | "Transform to dict with some schema" >> ...
dictionaries | beam.io.gcp.bigquery.WriteToBigQuery(...,method=STORAGE_WRITE_API, use_at_least_once=True,...)
dictionaries | beam.WindowInto(FixedWindows(size=300)) | ...  # (aggregate, write to another BigQuery table, etc)

Upvotes: 1

Related Questions