Reputation: 51
I created a new pipeline in dataflow with a fixed window of 10 minutes using the event timestamp. So initially there will be no message and the watermark will be near real-time.
Now suppose during the window 10.10 to 10.20, at 10.12 I started publishing frequent messages with event time 10.12 for 20 minutes till 10.32. Does that mean the watermark will remain at 10.12 till 10.32 and will not advance even if the window time crosses 10.20 and won't emit the result?
I just want to understand how the watermark will progress in this scenario. Will it wait till all the messages with event time 10.12 are acknowledged and a new message with event time more than 10.12 or a sleeping time of 2 minutes.
And the data watermark that we see in dataflow is the event watermark or system watermark
Upvotes: 2
Views: 2628
Reputation: 48456
Although robertwb gives one wonderful answer to this question. I want to summarize the basic concept of Watermark in Dataflow, based on Watermarks: Time and Progress in Apache Beam and Beyond, and hope help others to understand the watermark more clearly.
Upvotes: 1
Reputation: 5104
There are two separate things to consider when trying to think about watermarks: (1) where the watermark comes from and (2) how it is propagated through the pipeline.
For (2), if you are using standard fixed windows, the watermark will be held back by the minimum of the upstream watermark and the timestamp of the window. E.g. suppose the data coming into your GBK is
<input watermark now at 10:10> [output watermark is 10:10]
<input element with timestamp 1:12>. [output watermark stays at 1:10]
<input watermark now at 10:13> [output watermark now at 1:13]
<input element with timestamp 1:17> [output watermark stays at 1:13]
<input element with timestamp 1:23> ...
<input element with timestamp 1:14> [output watermark stays at 1:13]
Here the output watermark of this operation will be 1:13 being held up by the input watermark. Once the system as determined all upstream data to a certain point has been received, it can update the input watermark, but the output watermark remains at 10:20, because there's still data (the window) to be published at that timestamp. It does not matter how much walltime passes, the watermark will be stuck.
<input watermark now at 10:22> [output watermark stays at 1:20]
Now the window gets published, and subsequently the output watermark advances.
<output window at 10:20> [output watermark stays at 1:20]
<output watermark advances to 10:22> [output watermark now at 1:22]
...
As for (1), the Source is responsible for publishing both the timestamped data, and the watermark (e.g. "I promise not to publish data with timestamps before time X") into the pipeline. Each source has its own implementation for how to "know" a bound on the timestamps of future elements. IIRC, PubSub reads a head and computes a heuristic about what messages it expects to see in the future.
Upvotes: 3