Reputation: 1240
I tried a little experiment, and I'm wondering how to explain what I'm seeing. The purpose of the experiment was to try to understand how Kafka Streams is doing multithreading. I created and populated an input Topic with three partitions. Then I created a Streams graph that included the following, and configured it to run with three threads.
kstream = kstream.mapValues(tsdb_object -> {
System.out.println( "mapValues: Thread " + Thread.currentThread().getId());
return tsdb_object;
});
// Add operator to print results to stdout:
Printed<Long, TsdbObject> printed = Printed.toSysOut();
kstream.print(printed);
KGroupedStream<Long, TsdbObject> kstream_grouped_by_key = kstream.groupByKey(Serialized.with(Serdes.Long(), TsdbObject.getSerde()));
KTable<Long, TsdbObject> summation =
kstream_grouped_by_key.reduce((tsdb_object1, tsdb_object2) -> {
System.out.println("reducer: Thread " + Thread.currentThread().getId());
return tsdb_object1;
});
I figured that the first print statement would print out messages with three different thread id's, and that's what happened. However, I figured that the second print statement, being issued in the middle of an aggregation (reducer) operation, would print out messages listing only one thread id, under the assumption that the reduction would NOT be multithreaded. This turned out not to be true: the second print produced messages listing three different thread id's.
Can someone please explain briefly how the aggregation (reducer) is running in three different threads? Are they running in parallel?
Upvotes: 0
Views: 831
Reputation: 62330
Yes, the aggregation is execute with 3 threads as well, and each thread does the aggregation for about 1/3 of all keys.
Why would you assume that the aggregation is not multithreaded? Note, that it's an aggregation per key, thus the result for each key is independent of the result of all others keys. This allows to parallelize the computation.
Upvotes: 1