livathinos
livathinos

Reputation: 25

Dataflow pipeline waits for elements from all streams before performing GroupBy

We are running a Dataflow job that handles multiple input streams. Some of them are high traffic and some of them rarely get messages through. We are joining all streams with a "shared" stream that contains information relevant to all elements. This is a simplified example of the pipeline:

Pipeline Example

I noticed that the job will not produce any output, until both streams contain some traffic.

For example, let's suppose that Stream 1 gets a steady flow of traffic, whereas Stream 2 does not produce any messages for a period of time. For this time, the job's DAG will show elements being accumulated in the GroupByKey step but nothing will be propagated beyond it. I can also see the Flatten PCollections step showing input elements for the left side of the graph but not the right one. This creates a problem when dealing with high traffic and low traffic streams in the same job, since it will cause output to be delayed for as much as it takes for Stream 2 to pick up messages.

I am not sure if the observation is correct, but I wanted to ask if this is how Flatten/GroupByKey works in general and if so, if the issue we're seeing can be avoided through an alternative way of constructing the pipeline.

(Example JobID: 2017-02-10_06_48_01-14191266875301315728)

Upvotes: 2

Views: 1383

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

As described in the documentation of group-by-key the default behavior is to wait for all data within the window to have arrived -- this is necessary to ensure correctness of down-stream results.

Depending on what you are trying to do, you may be able to use triggers to cause the aggregates to be output earlier.

You may also be able to use the slow-stream as a side-input to the processing of the fast-stream.

If you're still stuck, it would help if you could describe in more detail the contents of the streams and how you're trying to use them, since more detailed answers depend on the goal.

Upvotes: 3

Related Questions