user179156
user179156

Reputation: 871

beam join two streams with different windowing strategy

I have two independent streams of events , for one stream i have hourly buckets and for another stream i have 4 hour buckets , is it possible to join these two streams . How can i identify which windows on both stream to join ? Can i have a sliding window on one stream and join this with a fixed window on another stream example use case is i partition one stream into fixed minutely/hourly buckets but want them to join with the 24 hour rolling/sliding bucket the windows have to be aligned with same start time . Is it possible doing such in spark ?

Upvotes: 0

Views: 1085

Answers (1)

Lefteris S
Lefteris S

Reputation: 1672

In dataflow, you can do what you are looking for using Side Inputs. Your first stream (an unbounded PCollection) will be the main input to a ParDo transform. Your second stream will be the side input. The latter will be of type PCollectionView, which is a way of representing a PCollection as a single entity and you can pass it to your ParDo transform by invoking .withSideInputs. Since your side input is infinite and thus cannot be compressed into a single value, the PCollectionView will represent a single entity per window.

Regarding different window sizes, Dataflow projects the main input element's window into the side input's window set and then chooses the most appropriate side input window. In your example use case dataflow will project the main (hourly) input window against the side (24 hour) input window set and select the side input value from the appropriate 24 hour side input window.

Upvotes: 1

Related Questions