Recovering Kafka Cluster from a disk full error

Question

We have a 3-node Kafka cluster. For data storage we have 2 mounted disks - /data/disk1 and /data/disk2 in each of the 3 nodes. The log.dirs setting in kafka.properties is:

log.dirs=/data/disk1/kafka-logs,/data/disk2/kafka-logs

It so happened that in one of nodes Node1, the disk partition /data/disk2/kafka-logs got 100% full.

The reason this happened is - we were replaying data from filebeat to a kafka topic and a lot of data got pushed in a very short time. I've temporarily changed the retention for that topic to 1 day from 7 days and so the topic size has become normal.

The problem is - in Node1 which has got /data/disk2/kafka-logs 100% full, kafka process just wouldn't start and emit the error:

Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,093] INFO Recovering unflushed segment 0 in log my-topic-0. (kafka.log.Log)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,094] INFO Completed load of log my-topic-0 with 1 log segments and log end offset 0 in 2 ms (kafka.log.Log)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,095] ERROR There was an error in one of the threads during logs loading: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code (kafka.log.LogManager)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,101] FATAL [Kafka Server 1], Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
Jul 08 12:03:29 broker01 kafka[23949]: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
Jul 08 12:03:29 broker01 kafka[23949]: at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
Jul 08 12:03:29 broker01 kafka[23949]: at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
Jul 08 12:03:29 broker01 kafka[23949]: at org.apache.kafka.common.record.FileLogInputStream$FileChannelLogEntry.loadRecord(FileLogInputStream.java:135)
Jul 08 12:03:29 broker01 kafka[23949]: at org.apache.kafka.common.record.FileLogInputStream$FileChannelLogEntry.record(FileLogInputStream.java:149)
Jul 08 12:03:29 broker01 kafka[23949]: at kafka.log.LogSegment.$anonfun$recover$1(LogSegment.scala:22

The replication factor for most topics is either 2 or 3. So, I'm wondering if I can do the following:

Change replication factor to 2 for all the topics (Node 2 and Node 3 are running fine)
delete some stuff from Node1.
Restart Node 1
Change replication factor back to 2 or 3 as was the case initially.

Does anyone know of a better way or a better suggestion?

Update: Step 1 and 4 not needed. Just 2 and 3 are enough if you have replicas.

Andrei · Accepted Answer

Your problem (and solution accordingly) is similar to that described in this question: kafka 0.9.0.1 fails to start with fatal exception

The easiest and fastest way is to delete part of the data. When the broker started, the data is replicated with the new retention.

So, I'm wondering if I can do the following...

Answering your question specifically - yes, you can do the steps you described in sequence and this will help to return the cluster to a consistent state.

To prevent this from happening in the future, you can try using the parameter log.retention.bytes instead of log.retention.hours, although I believe that the use of size-based retention policy for logs is not the best choice, because as my practice shows in most cases it is necessary to know the time at least of which the topic will be stored in the cluster.

Recovering Kafka Cluster from a disk full error

Answers (1)

Related Questions