Is it possible to create a Beam pipeline with multiple windowing needs

Question

I am trying to think of how to architect some data pipeline needs, and I simply want to know if the following is possible:

Can I create an Apache Beam pipeline that can stream data fully real time while also aggregating into windows? Specifically, I would like to:
1. Read data in from a Pub/Sub subscription.
2. Send that data onward immediately to a BigQuery table, as is.
3. Create an Apache Beam window of, say, 5 minutes.
4. Aggregate that same data read in from the Pub/Sub message (with some other light transformations).
5. Write the aggregated/transformed data into a different BigQuery table.
6. As an always-on streaming pipeline process.

I know that I can do this in other ways. For instance, I can create 2 dataflow jobs / pipelines that listen to the same subscription. (Or would it be better to have 2 separate subscriptions listening to the same topic?) I can also create a subscription for the dataflow job, then another subscription (to the same topic) that just pushes to BigQuery immediately.

But if I could have one set of code -- and therefore one CI/CD job to monitor -- to accomplish both, it simplifies what we need to maintain, and it would be much preferred.

Is this possible?

Israel Herraiz · Accepted Answer

Yes, that is possible. For writing in BigQuery without using a window, you will need to use method equal STREAMING_INSERTS or STORAGE_WRITE_API (maybe with use_at_least_once set to True for even lower latency, but with risk of duplicates).

For the rest of the pipeline, you just need to have two branches. Something like this:

msgs = p | "Read from P/S" >> beam.io.ReadFromPubsub(...)
dictionaries = msgs | "Transform to dict with some schema" >> ...
dictionaries | beam.io.gcp.bigquery.WriteToBigQuery(...,method=STORAGE_WRITE_API, use_at_least_once=True,...)
dictionaries | beam.WindowInto(FixedWindows(size=300)) | ...  # (aggregate, write to another BigQuery table, etc)

Is it possible to create a Beam pipeline with multiple windowing needs

Answers (1)

Related Questions