Reputation: 944
I would like to ask you if I need a replication factor in my KafkaStreamsConfiguration if I do not use stateful streaming. I don't use this RockDB. As far as I read replication factor setting is for changelog and repartition topics. I understand this changelog topic, but this repartition topic is a little confusing me... Can someone explain me in really basic words what this reparitions topic is and if I should care about this replication factor if I dont use state in my streaming app?
Regards
Upvotes: 1
Views: 2774
Reputation: 7563
In simple words, repartition occurs in Kafka Streams when you change the keys of the events/messages that you are processing.
Repartition is basically the shuffle phase on stream processing. This can happen in Kafka stream, Apache Spark, Flink, Storm, Hadoop, .... These are Distributed Stream Processing Engines (DSPE) which aims to execute tasks in parallel to speed-up the process. Then, when you call a map
transformation the DSPE transform this logical map
into a physical map
tasks of parallelism X (X is usually the number of the cores of the machine).
So, if you use mapValues
and does not change the keys, Kafka stream will not repartition. But if you change keys using map
, Kafka stream will repartition. Moreover, if you use any aggregate transformation (e.g.: reduce
, join
, ...) Kafka will execute repartition because it is based on keys.
The repartition/shuffle phase happens when there is an aggregation phase. Suppose you have a logical pipeline:
... -> map
-> reduce
-> ...
The physical pipeline under the hood will look like this:
Events with the same key are grouped by the groupByKey
transformation and send to the same reduce
parallel tasks instance. This is the shuffle phase.
In the case of Kafka stream when an aggregation happens the pipeline transforms from KStream
to KTable
because the messages are distributed on the Kafka brokers and the stream engine has to lookup the events on different partitions. If you use IntelliJ it inffers to you when the pipeline changed. In the figure below it is happening a word count and the count
transformation is stateful like the reduce
.
This is a good source to read more about repartition in Kafka Stream. Like I said, others DSPE also rely on repartition during the shuffle phase. Other nice source is this one from Flink.
Upvotes: 5