Applying multiple GroupByKey transforms in a DataFlow job causing windows to be applied multiple times

Question

We have a DataFlow job that is subscribed to a PubSub stream of events. We have applied sliding windows of 1 hour with a 10 minute period. In our code, we perform a Count.perElement to get the counts for each element and we then want to run this through a Top.of to get the top N elements.

At a high level: 1) Read from pubSub IO 2) Window.into(SlidingWindows.of(windowSize).every(period)) // windowSize = 1 hour, period = 10 mins 3) Count.perElement 4) Top.of(n, comparisonFunction)

What we're seeing is that the window is being applied twice so data seems to be watermarked 1 hour 40 mins (instead of 50 mins) behind current time. When we dig into the job graph on the Dataflow console, we see that there are two groupByKey operations being performed on the data: 1) As part of Count.perElement. Watermark on the data from this step onwards is 50 minutes behind current time which is expected. 2) As part of the Top.of (in the Combine.PerKey). Watermark on this seems to be another 50 minutes behind the current time. Thus, data in steps below this is watermarked 1:40 mins behind.

This ultimately manifests in some downstream graphs being 50 minutes late.

Thus it seems like every time a GroupByKey is applied, windowing seems to kick in afresh.

Is this expected behavior? Anyway we can make the windowing only be applicable for the Count.perElement and turn it off after that?

Our code is something on the lines of:

final int top = 50;
final Duration windowSize = standardMinutes(60);
final Duration windowPeriod = standardMinutes(10);
final SlidingWindows window = SlidingWindows.of(windowSize).every(windowPeriod);

options.setWorkerMachineType("n1-standard-16");
options.setWorkerDiskType("compute.googleapis.com/projects//zones//diskTypes/pd-ssd");
options.setJobName(applicationName);
options.setStreaming(true);
options.setRunner(DataflowPipelineRunner.class);

final Pipeline pipeline = Pipeline.create(options);

// Get events
final String eventTopic =
    "projects/" + options.getProject() + "/topics/eventLog";
final PCollection events = pipeline
    .apply(PubsubIO.Read.topic(eventTopic));

// Create toplist
final PCollection>> topList = events
    .apply(Window.into(window))
    .apply(Count.perElement()) //as eventIds are repeated
    // get top n to get top events
    .apply(Top.of(top, orderByValue()).withoutDefaults());

Applying multiple GroupByKey transforms in a DataFlow job causing windows to be applied multiple times

Answers (1)

Related Questions