Reputation: 64
It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally
and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults
which will then emit nothing if the window was empty.
Upvotes: 2
Views: 109
Reputation: 206
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection
with a heartbeat PCollection
consisting of empty values. So for the example of Sum.integersGlobally
, you would Flatten
your main PCollection<Integer>
with a secondary PCollection<Integer>
that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows
or SlidingWindows
; Sessions
are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten
to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource
that accepts an enumerable WindowFn
plus a default value and generates an appropriate unbounded PCollection
.
Upvotes: 4