How Kafka streams handle distributed data

Question

I have tried to go through various tutorials but am not clear on two aspects of Kafka streams. Lets take the word count example mentioned in: https://docs.confluent.io/current/streams/quickstart.html

// Serializers/deserializers (serde) for String and Long types
final Serde stringSerde = Serdes.String();
final Serde longSerde = Serdes.Long();

// Construct a `KStream` from the input topic "streams-plaintext-input", where message values
// represent lines of text (for the sake of this example, we ignore whatever may be stored
// in the message keys).
KStream textLines = builder.stream("streams-plaintext-input", Consumed.with(stringSerde, stringSerde));

KTable wordCounts = textLines
// Split each text line, by whitespace, into words.  The text lines are the message
// values, i.e. we can ignore whatever data is in the message keys and thus invoke
// `flatMapValues` instead of the more generic `flatMap`.
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\W+")))
// We use `groupBy` to ensure the words are available as message keys
.groupBy((key, value) -> value)
// Count the occurrences of each word (message key).
.count();

// Convert the `KTable` into a `KStream` and write to the output topic.
wordCounts.toStream().to("streams-wordcount-output", 
Produced.with(stringSerde, longSerde));

Couple of questions here:
1.) Since there are no keys in the original stream, two words can land up at two different nodes as they might fall in different partition and hence true count would be the aggregation from both of them. It does not seem to be done here ? Do different nodes serving same topic's partition coordinate here to aggregate the count ?
2.) As the new stream is generated by each operation (e.g. flatMapValues, groupBy etc) are the partitions recalculated for messages in these substreams so that they land up on different nodes ?

Will appreciate any help here!

How Kafka streams handle distributed data

Answers (1)

Related Questions