How to optimize aggregation so that aggregatiton per consumer is done first?

Question

I have a kafka topic named input with multiple partitions.

Let's say a message looks like this:

{
    "key": 123456, 
    "otherKey": 444, 
    ... 
}

Records are partitioned by "key" (and so the same key will always end up processed by the same Kafka consumer).

Now I would like to count the number of events for each "otherKey" per minute. This, to my understanding, can easily be done using KStreams like this:

input.groupBy((k, v) -> v.getOtherKey())
     .windowedBy(TimeWindows.of(Duration.of(60, SECONDS)))
     .count()
     .suppress(untilWindowCloses(Suppressed.BufferConfig.unbounded()))
     .toStream()
     .to("output");

With groupBy, Kafka Streams will repartition the data to an internal kafka topic, with 1 event for each event in input topic.

This seems like a waste to me. It could have counted the messages in each kafka consumer (counting only for the consumer's partitions) per "otherKey" and publish to the internal topic only once a minute per "otherKey".

Is there a way to do this using Kafka Streams?

How to optimize aggregation so that aggregatiton per consumer is done first?

Answers (1)

Related Questions