How does Kafka Streams multithread/parallelize aggregation operations?

Question

I tried a little experiment, and I'm wondering how to explain what I'm seeing. The purpose of the experiment was to try to understand how Kafka Streams is doing multithreading. I created and populated an input Topic with three partitions. Then I created a Streams graph that included the following, and configured it to run with three threads.

kstream = kstream.mapValues(tsdb_object -> {
    System.out.println( "mapValues: Thread " + Thread.currentThread().getId());
    return tsdb_object;
});

// Add operator to print results to stdout:
Printed printed = Printed.toSysOut();
kstream.print(printed);

KGroupedStream kstream_grouped_by_key = kstream.groupByKey(Serialized.with(Serdes.Long(), TsdbObject.getSerde()));
KTable summation =
    kstream_grouped_by_key.reduce((tsdb_object1, tsdb_object2) -> {
        System.out.println("reducer: Thread " + Thread.currentThread().getId());
        return tsdb_object1;
    });

I figured that the first print statement would print out messages with three different thread id's, and that's what happened. However, I figured that the second print statement, being issued in the middle of an aggregation (reducer) operation, would print out messages listing only one thread id, under the assumption that the reduction would NOT be multithreaded. This turned out not to be true: the second print produced messages listing three different thread id's.

Can someone please explain briefly how the aggregation (reducer) is running in three different threads? Are they running in parallel?

How does Kafka Streams multithread/parallelize aggregation operations?

Answers (1)

Related Questions