Chris W.
Chris W.

Reputation: 2386

How to merge multiple kafka streams in order to do a session windowing over all events of the resulting stream

We have multiple input topics with different business events (page views, clicks, scroll events etc). As far as I understood Kafka streams they all get an event timestamp, which can be used for KStream joins with other streams or tables to align the times.

What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.

This should by possible by using groupByKey and then aggregate/reduce (specifying the Inactivity time here) on a stream containing all events. This combined stream must have all events from the different input topics in an order of the event time (or in a way that the above kafka streams methods honor this event times).

The only challenge that is left, is to create this combined / merged stream.

When I look at the Kafka Streams API, there is the KStreamBuilder#merge operation for which the javadoc says: There is no ordering guarantee for records from different {@link KStream}s.. Does this mean the Session Windowing will produce incorrect results?

If yes, what is the alternative to #merge?

Upvotes: 3

Views: 6637

Answers (2)

Matthias J. Sax
Matthias J. Sax

Reputation: 62350

I was also thinking about joining, but in fact it seems to depend if you have one event per topic per ID, or potentially multiple events with the same ID within one input topic. For the first case, joining is a good strategy but not for the later, as you would get some unnecessary duplication.

stream A: <a,1> <a,2>
stream B: <a,3>
join-output plus session: <a,1-3 + 2-3>

Number 3 would be a duplicate.

Also keep in mind, that joining slightly modifies the time stamps and thus your session windows might be different if you apply them on the join result or on the raw data.

About merge() and ordering. You can use merge() safely as the session windows will be build based on record timestamp and not offset-order. And all window operations in Kafka Streams can handle out-of-order data gracefully.

Upvotes: 4

miguno
miguno

Reputation: 15087

What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.

From what I understand, you'd need to join the streams (and use groupBy to ensure that they can be properly joined by user id), not merge them. You can then follow-up with an session-windowed aggregation.

Upvotes: 0

Related Questions