Reputation: 2386
We have multiple input topics with different business events (page views, clicks, scroll events etc). As far as I understood Kafka streams they all get an event timestamp, which can be used for KStream joins with other streams or tables to align the times.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
This should by possible by using groupByKey
and then aggregate/reduce
(specifying the Inactivity time here) on a stream containing all events. This combined stream must have all events from the different input topics in an order of the event time (or in a way that the above kafka streams methods honor this event times).
The only challenge that is left, is to create this combined / merged stream.
When I look at the Kafka Streams API, there is the KStreamBuilder#merge
operation for which the javadoc says: There is no ordering guarantee for records from different {@link KStream}s.
. Does this mean the Session Windowing will produce incorrect results?
If yes, what is the alternative to #merge?
Upvotes: 3
Views: 6637
Reputation: 62350
I was also thinking about joining, but in fact it seems to depend if you have one event per topic per ID, or potentially multiple events with the same ID within one input topic. For the first case, joining is a good strategy but not for the later, as you would get some unnecessary duplication.
stream A: <a,1> <a,2>
stream B: <a,3>
join-output plus session: <a,1-3 + 2-3>
Number 3
would be a duplicate.
Also keep in mind, that joining slightly modifies the time stamps and thus your session windows might be different if you apply them on the join result or on the raw data.
About merge()
and ordering. You can use merge()
safely as the session windows will be build based on record timestamp and not offset-order. And all window operations in Kafka Streams can handle out-of-order data gracefully.
Upvotes: 4
Reputation: 15087
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
From what I understand, you'd need to join the streams (and use groupBy
to ensure that they can be properly joined by user id), not merge them. You can then follow-up with an session-windowed aggregation.
Upvotes: 0