Reputation: 588
What is a watermark in Flink with respect to Event time processing? Why is it needed.? Why is it needed in all cases of event time being used. By all cases I mean if i dont do a window opeation then why do we still need a water mark. I come from spark background. In spark we need watermarks only when we use windows on the incoming events.
I have read few articles and it seems to me that watermarks and windows seems same.If there are differences please explain and point it put
Post your reply I did some more reading. Below is a query that is more specific.
Main Question:- Why do we need outoforder when we have acceptedlateness.
Given below example:
Assume you have a BoundedOutOfOrdernessTimestampExtractor with a 2 minute bound and a 10 minute tumbling window that starts at 12:00 and ends at 12:10:
12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G
In the above example [12:02, C] record is not dropped but included into the window 12:00 -12:10 and later evaluated.- Hence the watermark could as well be the event timestamp
The record [12:09, G] is included into the window 12:00 - 12:10 only when there is a acceptedlateness of 5mins configured. This takes care of late and out of order events
So now adding to my previous question above, what is the necessary of outoforder option to be BoundedOutOfOrdernessTimestampExtractor of some value(other than 0) instead of the event timestamp istelf ?
What is that outoforder can achieve which allowedlateness cannot and in what scenario it does?
Upvotes: 3
Views: 4017
Reputation: 3634
Watermarks and windows are closely related but they are very different concepts.
Watermarks are needed for any kind of event-based aggregation to cut off late events. Windows can only close when they receive an appropriate watermark and that's when results of aggregations are published.
If you have no out of order events, you can set watermarks to be equivalent to the timestamps of input events. But that's usually a luxury.
edit to address questions in comment.
is it a rule of thumb to keep the watermarks duration equal to window duration because by only doing so the result is calculated and emitted.
No, the durations are independent, but add up the lag on a given event.
Your watermark duration depends on your data and how much lag you can take for your application. Let's say most events are in order, 10% are coming up to 1s late, an additional 5% up to 10s, and 1% up to 1h.
If you set watermark duration to 0, then 16% of your data points are discarded, but Flink will receive no additional lag. If your watermark trails 1s behind your events, you will lose 6% of your data, but the results will have 1s more lag. If you want to retain all data, Flink will need to wait for 1h on each aggregation until Flink can be sure that no data is missing.
But then what is the role of trigger? and how does sliding windows coordinate with water marks and triggers. Can you please explain as how they work with each other ?
Let's say, you have a window of 1 min and a watermark delay of 5 s. A window will only trigger when it is sure that all relevant data has been seen. In this case, it needs to wait 1 min 5 s to trigger, such that the last event of the window has surely arrived.
Btw events later as watermark are discarded by default. You can change that behavior.
Upvotes: 8