Ernesto
Ernesto

Reputation: 944

changelog topics and repartition topics kafka streams

I would like to ask you if I need a replication factor in my KafkaStreamsConfiguration if I do not use stateful streaming. I don't use this RockDB. As far as I read replication factor setting is for changelog and repartition topics. I understand this changelog topic, but this repartition topic is a little confusing me... Can someone explain me in really basic words what this reparitions topic is and if I should care about this replication factor if I dont use state in my streaming app?

Regards

Upvotes: 1

Views: 2774

Answers (1)

Felipe
Felipe

Reputation: 7563

In simple words, repartition occurs in Kafka Streams when you change the keys of the events/messages that you are processing.

Repartition is basically the shuffle phase on stream processing. This can happen in Kafka stream, Apache Spark, Flink, Storm, Hadoop, .... These are Distributed Stream Processing Engines (DSPE) which aims to execute tasks in parallel to speed-up the process. Then, when you call a map transformation the DSPE transform this logical map into a physical map tasks of parallelism X (X is usually the number of the cores of the machine).

So, if you use mapValues and does not change the keys, Kafka stream will not repartition. But if you change keys using map, Kafka stream will repartition. Moreover, if you use any aggregate transformation (e.g.: reduce, join, ...) Kafka will execute repartition because it is based on keys.

The repartition/shuffle phase happens when there is an aggregation phase. Suppose you have a logical pipeline:

... -> map -> reduce -> ...

The physical pipeline under the hood will look like this:

enter image description here

Events with the same key are grouped by the groupByKey transformation and send to the same reduce parallel tasks instance. This is the shuffle phase.

In the case of Kafka stream when an aggregation happens the pipeline transforms from KStream to KTable because the messages are distributed on the Kafka brokers and the stream engine has to lookup the events on different partitions. If you use IntelliJ it inffers to you when the pipeline changed. In the figure below it is happening a word count and the count transformation is stateful like the reduce.

enter image description here

This is a good source to read more about repartition in Kafka Stream. Like I said, others DSPE also rely on repartition during the shuffle phase. Other nice source is this one from Flink.

Upvotes: 5

Related Questions