Reputation: 43
I have a Spark streaming (Scala) application running in CDH 5.13 consuming messages from Kafka using client 0.10.0. My Kafka cluster contains 3 brokers. Kafka topic is divided into 12 partitions evenly distributed between these 3 brokers. My Spark streaming consumer has 12 executors with 1 core each. Spark streaming starts reading millions of messages from Kafka in each batch, but reduces the number to thousands due to the fact that Spark is not capable to cope with the load and queue of unprocessed batches is created. That is fine and my expectation though is that Spark processes the small batches very quickly and returns to normal, however I see that from time to time one of the executors that processes only few hundreds of messages gets 'request timeout' error just after reading the last offset from Kafka:
DEBUG org.apache.clients.NetworkClient Disconnecting from node 12345 due to request timeout
After this error, executor sends several RPC requests driver that take ~40 seconds and after this time executor reconnects to the same broker from which it disconnected.
My question is how can I prevent this request timeout and what is the best way to find the root cause for it?
Thank you
Upvotes: 2
Views: 14847
Reputation: 43
The root cause for disconnection was the fact that response for data request arrived from Kafka too late. i.e. after request.timeout.ms
parameter which was set to default 40000 ms. The disconnection problem was fixed when I increased this value.
Upvotes: 1