Reputation: 871
How accurate are watermark estimates in stream processing in apache beam or spark streaming. My data source are files from gcs/s3 , but i use event time associated with each event as the timestamp for windowing function. Any ideas on how is this heuristic or estimate calculated by these stream processing engines and if there is way to measure how bad this estimate was.
My use case i have several server produceing event logs on gcs/S3 and then i am reading this files in a streaming way from my stream processing engine. So there can be delayed due to filesystem outages and failures or servers not being able to flush log events for couple hours. So In my stream processing pipeline correctness is one of the important aspect when aggregating some events. So I am curios how is this watermark estimate computed
Upvotes: 1
Views: 1194
Reputation: 814
Generally speaking, watermark is determined by the source. When a source announces a watermark of T, it is saying "I don't expect any more records with eventtime earlier than T". The streaming engine can then proceed to close the related windows etc. There could still be some events that arrive with timestamp less than T, and those will be considered "late". In Apache Beam, you have control on such late events as well. Sources in Apache Beam provide watermark by implementing getWatermark() interface (documentation there is quite helpful too).
In your case, critical part would be to know how delayed these files could be. You mentioned couple of hours. A simple heuristic could be keep watermark to 'latest event time - 2 hours'
. Based on expected distribution of delays, you could limit that to 10 minutes to get most of the benefit and treat further delayed events as 'late'.
Upvotes: 2