Reputation: 3299
AWS Kinesis offers a stream windowing implementation that helps "analyzing groups of data that arrive at inconsistent times", stagger windows.
Such a window implementation is particularly powerful as it ensures the window starts only when the first event (as defined by the event grouping) is received and finishing a fixed time later, reducing the number of events, received very shortly after one another, ending up in separate windows.
Kinesis appears to be a great choice for a quick and easy stream implementation choice but with a view to reviewing potential future 'lock-in' we're trying to understand how we might recreate similar functionality, if required, using Kafka streams.
Kafka streams appears to support the following windowing functions:
Based on our existing research session windows might be the closest option to stagger. What we've noticed however that session windows can still be 'updated' if a late event arrives even after that session would otherwise have been considered 'expired/emitted', and also that sessions may not be emitted until future 'stream time' events are recorded?
I'd therefore kindly like to ask what/if the closest implementation of the stagger window might be in Kafka and what potential 'gotchas' are important to be aware of.
Upvotes: 1
Views: 289
Reputation: 62285
Session windows might be somewhat similar, however, session windows don't have a fixed size. Window boundaries are determined by a "gap" parameter. Taking the example for the Amazon docs, the first two events (let's call them A and B) are 10 seconds apart, the second and third (C) 35 seconds, and the third and fourth (D) 10 second. If you specify a gap of 10 seconds, you would get two windows of A,B and C,D that is different to tumbling and different to stagger windows. If you specify a gap if 35 seconds, you get one window with all 4 events.
Depending on your use case, it might still work using session windows.
What we've noticed however that session windows can still be 'updated' if a late event arrives even after that session would otherwise have been considered 'expired/emitted',
Yes, this is required to handle out-of-order record correctly. I am not sure what the support for event-time is in Kinesis -- it seems their tumbling windows align to ROWTIME (it this wall-clock time?). However, using suppress()
, you can get exactly one result per session (by trading off some processing latency). Check out this blog post for more details: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
and also that sessions may not be emitted until future 'stream time' events are recorded?
That's correct. But this would only happen, if no new data arrives at all, what should not be the case for a stream processing application with a continuous data flow.
What you could also do it, to implement the logic you want yourself, using transform()
with a windowed-state store. Using wall-clock time punctuations, you can also ensure that data is emitted even if not new input data arrives. The most challenging part will be the handling of out-of-order records for this case.
Upvotes: 2