Reputation: 9771
I have an intermittent issue in my logs.
It seems the heart-thread is constantly struggling and get
Error sending fetch request
org.apache.kafka.common.errors.DisconnectException
Group coordinator xxx is unavailable or invalid due to cause: coordinator unavailable.isDisconnected: true. Rediscovery will be attempted.
i already made the heartbeat.interval.ms a bit larger, but this it is still happening.
I would like to understand what this can cascade into, in a kafka-streams application configured for static membership. In particular can it lead to re-balance ?
I have also increased request.timeout.ms
shall I also increase delivery.timeout.ms
, if so why?
I am trying to understand how things cascade. That is, What happen if the heartbeat thread keep trying and then reach delivery.timeout.ms
? is that what causes is unavailable or invalid due to cause: coordinator unavailable.isDisconnected: true
? What controls the subsequent retry after that. Indeed the consumer isnot failing, itrecoves asitultimately discover the coordinator ?
I am a little and confused about how this can cascade, hence looking for a way to explain it, and get a handle on it.
Any idea on how to help with this ?
09:48:12.357 [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - stream-thread [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] Processed 70000 total records, ran 0 punctuators, and committed 3 total tasks since the last update
09:48:21.125 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=1490129196, epoch=INITIAL) to node 0:
org.apache.kafka.common.errors.DisconnectException: null
09:48:21.125 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=1666554310, epoch=INITIAL) to node 4:
org.apache.kafka.common.errors.DisconnectException: null
09:48:21.225 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=522549112, epoch=INITIAL) to node 5:
org.apache.kafka.common.errors.DisconnectException: null
09:48:42.711 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-2, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=810037183, epoch=INITIAL) to node 2:
org.apache.kafka.common.errors.DisconnectException: null
09:49:23.172 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-2, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=810037183, epoch=INITIAL) to node 2:
org.apache.kafka.common.errors.DisconnectException: null
09:50:17.295 [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - stream-thread [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] Processed 42290 total records, ran 0 punctuators, and committed 3 total tasks since the last update
09:52:28.299 [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - stream-thread [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] Processed 60000 total records, ran 0 punctuators, and committed 1 total tasks since the last update
09:54:35.743 [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - stream-thread [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] Processed 60000 total records, ran 0 punctuators, and committed 1 total tasks since the last update
09:55:07.269 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Group coordinator sdc-oxygen-dev-cp-kafka-4.sdc-oxygen-dev-cp-kafka-headless.sdc-oxygen-dev:9092 (id: 2147483643 rack: null) is unavailable or invalid due to cause: coordinator unavailable.isDisconnected: true. Rediscovery will be attempted.
09:55:07.371 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Discovered group coordinator sdc-oxygen-dev-cp-kafka-4.sdc-oxygen-dev-cp-kafka-headless.sdc-oxygen-dev:9092 (id: 2147483643 rack: null)
09:55:07.473 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Discovered group coordinator sdc-oxygen-dev-cp-kafka-4.sdc-oxygen-dev-cp-kafka-headless.sdc-oxygen-dev:9092 (id: 2147483643 rack: null)
09:55:50.644 [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1] INFO o.a.k.s.p.internals.StreamThread - stream-thread [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1] Processed 30000 total records, ran 0 punctuators, and committed 1 total tasks since the last update
09:56:20.888 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=1490129196, epoch=INITIAL) to node 0:
org.apache.kafka.common.errors.DisconnectException: null
09:56:20.888 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=1666554310, epoch=INITIAL) to node 4:
org.apache.kafka.common.errors.DisconnectException: null
09:56:20.989 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=522549112, epoch=INITIAL) to node 5:
org.apache.kafka.common.errors.DisconnectException: null
09:56:47.135 [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - stream-thread [sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-2] Processed 70000 total records, ran 0 punctuators, and committed 1 total tasks since the last update
09:57:04.873 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=1490129196, epoch=INITIAL) to node 0:
org.apache.kafka.common.errors.DisconnectException: null
09:57:04.873 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=1666554310, epoch=INITIAL) to node 4:
org.apache.kafka.common.errors.DisconnectException: null
09:57:04.973 [kafka-coordinator-heartbeat-thread | _entellect-cbe-builder-resnet-0] INFO o.a.k.clients.FetchSessionHandler - [Consumer instanceId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-1, clientId=sdc-oxygen-dev-entellect-cbe-builder-resnet-22-StreamThread-1-consumer, groupId=_entellect-cbe-builder-resnet-0] Error sending fetch request (sessionId=522549112, epoch=INITIAL) to node 5:
org.apache.kafka.common.errors.DisconnectException: null
Upvotes: 0
Views: 1621
Reputation: 4798
i already made the heartbeat.interval.ms a bit larger, but this it is still happening
It's ineffective for avoiding the exception, As a matter of fact, It's a mechanism for fault-detection; For detecting abnormal situations. If you set a large interval, abnormal condition is detected later and vice versa. In conclusion, if you made it either large or small no difference.
I have also increased request.timeout.ms shall I also increase delivery.timeout.ms, if so why?
I believe they are both independent of each other. for example, If the request is sent to the end-point but it doesn't get any feedback, it doesn't mean the end-point is inactive. Indeed, It happens in reality when end-point is overwhelmed by load and can't feedback as soon as possible. As a result, request.timeout.ms
should set to a rational value. However, for delivery.timeout.ms
it should be set to a large number. As stated in official doc about delivery.timeout.ms
:
which sets an upper bound on the total time between sending a record and receiving acknowledgement from the broker. By default, the delivery timeout is set to 2 minutes.
As I mentioned.
I am trying to understand how things cascade. That is, What happen if the heartbeat thread keep trying and then reach delivery.timeout.ms ? is that what causes is unavailable or invalid due to cause: coordinator unavailable.isDisconnected: true
Yes, heartbeat thread signals to a master node that there is a suspect to failure group after timeout of any type either request or delivery.
What controls the subsequent retry after that
Admin, managing and inspecting topics, brokers, acls, and other Kafka objects
Upvotes: 1