I have a Kafka Streams Application which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers
Kafka Topics - 15 partitions and 3 replication factor.
Note: I am running Kafka Streams Applications on the same machines where my Kafka Brokers are running.
Few millions of records are consumed/produced every hour. Whenever I take any kafka broker down, it goes into rebalancing and it takes approx. 30 minutes or sometimes even more for rebalancing.
Anyone has any idea how to solve rebalancing issue in kafka consumer? Also, many times it throws exception while rebalancing.
This is stopping us from going live in production environment with this setup. Any help would be appreciated.
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: ?
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(
at org.apache.kafka.streams.processor.internals.StreamTask.commitOffsets(
at org.apache.kafka.streams.processor.internals.StreamTask.access$000(
at org.apache.kafka.streams.processor.internals.StreamTask$
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(
at org.apache.kafka.streams.processor.internals.StreamTask.commitImpl(
at org.apache.kafka.streams.processor.internals.StreamTask.suspend(
at org.apache.kafka.streams.processor.internals.StreamTask.suspend(
at org.apache.kafka.streams.processor.internals.StreamThread$3.apply(
at org.apache.kafka.streams.processor.internals.StreamThread.performOnStreamTasks(
at org.apache.kafka.streams.processor.internals.StreamThread.suspendTasksAndState(
Kafka Streams Config:
max.poll.records = 100
ConsumerConfig it internally creates is: = 5000
auto.offset.reset = earliest
bootstrap.servers = [kafka-1:9092, kafka-2:9092, kafka-3:9092, kafka-4:9092, kafka-5:9092]
check.crcs = true = conversion-live-StreamThread-1-restore-consumer = 540000 = false
exclude.internal.topics = true
fetch.max.bytes = 52428800 = 500
fetch.min.bytes = 1 = = 3000
interceptor.classes = null = false
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576 = 2147483647
max.poll.records = 100 = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536 = 1000 = 50 = 40000 = 100
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000 = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072 = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
to my experience, first your max.poll.records is too small given your workload: Few millions of records are consumed/produced every hour.
so if max.poll.records is too small say 1, then the rebalancing takes very long. i don't know the reason.
second, make sure the number of partitions of the input topics to you stream app are consistent. e.g. if APP-1 has two input topics A and B. if A has 4 partitions, and B has 2, then rebalancing takes very long. However, if A and B both have 4 partitions event some partitions are idle, then rebalancing time is good. hope it helps
I would recommend to configure StandbyTasks
via parameter num.standby.replicas=1
(default is 0
). This should help to reduce the rebalance time significantly.
Furthermore, I would recommend to upgrade your application to Kafka 0.11. Note, Streams API 0.11 is backward compatible to 0.10.1 and 0.10.2 brokers, thus, you don't need to upgrade your brokers for this. Rebalance behavior was heavily improved in 0.11 and will be further improved in upcoming 1.0 release (cf., thus, upgrading your application to the latest version is always an improvement for rebalancing.
