Kafka Streams with EXACTLY_ONCE_V2: InvalidProducerEpochException: Producer attempted to produce with an old epoch

Question

This is a simple, plain Kafka Streams application doing a simple record transformation configured using EXACTLY_ONCE_V2.

configurationParameters.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);

The error I see is as follows. If I remove the EXACTLY_ONCE_V2 setting mentioned above, this error vanishes, and the streams application runs for days without errors. The error log is:

[ERROR] 2021-11-27 18:10:23.141 [kafka-producer-network-thread | id-mapping-app-1eede139-ace6-4aff-9e94-ca508cb9c98d-StreamThread-1-producer] RecordCollectorImpl - stream-thread [id-mapping-app-1eede139-ace6-4aff-9e94-ca508cb9c98d-StreamThread-1] task [0_11] Error encountered sending record to topic data-records-output for task 0_11 due to: org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch. Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out

The application is using Kafka Streams 3.0.0, which is the latest as of this writing. Gradle style Maven coordinates are: org.apache.kafka:kafka-streams:3.0.0

The Kafka brokers are running Kafka 2.8.0 under Strimzi 0.23.0 on Kubernetes.

I also experienced the exact same error message with a similar application using the Flink framework instead of Kafka Streams:

Flink 1.13.2: `ProducerFencedException: Producer attempted an operation with an old epoch`

In both Flink and Kafka Streams, disabling exactly once makes the error go away. If I turn on exactly once processing the error occurs.

Aivaras · Accepted Answer

What is the information in broker logs? In my case the reason was:

INFO [TransactionCoordinator id=3] Completed rollback of ongoing transaction for transactionalId stats-mapper-ac83167f-5e02-4fb1-92cd-cec0e6c7332f-2 due to timeout (kafka.coordinator.transaction.TransactionCoordinator)
ERROR [ReplicaManager broker=3] Error processing append operation on partition stats.topic-1 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.InvalidProducerEpochException: Epoch of producer 1015 at offset 2395825914 in stats.topic-1 is 0, which is smaller than the last seen epoch 1

and it looks like it is a work in progress KIP-588 which is expected to be fixed in 3.2.0 KAFKA-9803.

KIP states that

This could be caused by a short period of inactivity of the client due to network partition or long GC.

While I am unsure what the "network partition" means - the GC is quite trivial to monitor and see.

Kafka Streams with EXACTLY_ONCE_V2: InvalidProducerEpochException: Producer attempted to produce with an old epoch

Answers (1)

Related Questions