Haifeng Zhang
Haifeng Zhang

Reputation: 31895

Kafka producer timeout exception

[1] 2022-01-18 21:56:10,280 ERROR [org.apa.cam.pro.err.DefaultErrorHandler] (Camel (camel-1) thread #9 - KafkaProducer[test]) Failed delivery for (MessageId: 95835510BC9E9B2-0000000000134315 on ExchangeId: 95835510BC9E9B2-0000000000134315). Exhausted after delivery attempt: 1 caught: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test-0:121924 ms has passed since batch creation
[1]
[1] Message History (complete message history is disabled)
[1] ---------------------------------------------------------------------------------------------------------------------------------------
[1] RouteId              ProcessorId          Processor                                                                        Elapsed (ms)
[1] [route1            ] [route1            ] [from[netty://udp://0.0.0.0:8080?receiveBufferSize=65536&sync=false]           ] [    125320]
[1]     ...
[1] [route1            ] [to1               ] [kafka:test?brokers=10.99.155.100:9092&producerBatchSize=0                     ] [         0]
[1]
[1] Stacktrace
[1] ---------------------------------------------------------------------------------------------------------------------------------------
[1] : org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test-0:121924 ms has passed since batch creation

Here's the flow for my project

  1. External Service ---> Netty
  2. Netty ---> Kafka(consumer)
  3. Kafka(producer) ---> processing events

1 and 2 are running in one Kubernetes pod and 3 is running in a separate pod.

I have encountered TimeoutException at the beginning saying like:

org.apache.kafka.common.errors.TimeoutException: Expiring 20 record(s) for test-0:121924 ms has passed since batch creation

I searched online and found a couple of potential solutions Kafka Producer error Expiring 10 record(s) for TOPIC:XXXXXX: 6686 ms has passed since batch creation plus linger time

Based on the suggestion, I have done:

  1. make the timeout bigger, double the default value
  2. make the batch size to 0, which will not send events in batch and keeps the memory usage low.

Unfortunately I still encounter the error due to memory is used up.

Does anyone know how to solve it? Thanks!

Upvotes: 1

Views: 2524

Answers (1)

Marco
Marco

Reputation: 1232

There are several things to take into account here. You are not showing up what your throughput is, you have to take into account that value and if your broker on 10.99.155.100:9092 is able to process such load. Did you check 10.99.155.100 during the time of the transfer? The fact that Kafka can potentially process hundreds of thousands of messages per second doesn't mean that you can do it on any hardware.

So, having said that, the timeout is the first to come to my mind, but in your case you have 2 minutes and still you are timing out, for me, this sounds more like a problem in your broker and not on your producer.

To understand the issue, basically, you are getting your mouth full faster than you can swallow, by the time push a message the broker is not able to acknowledge on time (in this case, 2 minutes)

What things you can do here:

  • Check the broker performance for the given load Change your delivery.timeout.ms to an acceptable value, I guess you have SLAs to attach to Increase your retry backoff timer (retry.backoff.ms) Do not put the batch size as 0, this will try a live push to the broker, which in case seems not possible for the load. Make sure your max.block.ms is set correctly Change to bigger batches (even if this increases latency), but not too big, you need to sit down, check how many records you are pushing and allocate them correctly.

Now, some rules:

  • delivery.timeout.ms must be bigger than the sum of request.timeout.ms and linger.ms All the above are impacted by the batch.size If you don't have so many rows, but those rows are huge! then control the max.request.size

So, to summarize, your properties to change are the following:

delivery.timeout.ms, request.timeout.ms, linger.mx, max.request.size

Assuming the hardware is good and also assuming that you are not sending more than you should, those should do the trick

Upvotes: 1

Related Questions