Reputation: 1997
How do I recover from a corrupt file in Kafka?
We are running a three node cluster with replication factor of 2 and ISR=1. Recently we had near-simultaneous failure of all brokers are the same time. This resulted in a situation where broker id 102 is down while the other two brokers recovered. Unfortunately at least one partition of one topic had 102 as the leader with isr of only 102 as well. This means that the other brokers are missing some (unknown) amount of data from this partition and that they thus refuse to receive/send data from this topic.
Since I would like to recover my cluster and my data I am trying to restart broker 102. But it fails on some unknown file with this mesage
[2018-07-18 14:44:44,806] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.KafkaException: java.io.EOFException: Failed to read `log header` from file channel `sun.nio.
ch.FileChannelImpl@375a9d12`. Expected to read 17 bytes, but reached end of file after reading 0 bytes. Started read from position 2147483631. (kafka.log.LogManager)
[2018-07-18 14:44:44,809] ERROR [KafkaServer id=102] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.kafka.common.KafkaException: java.io.EOFException: Failed to read `log header` from file channel `sun.nio.ch.FileChannelImpl@375a9d12`. Expected to read 17 bytes, but reached end of file after reading
0 bytes. Started read from position 2147483631.
Unfortunately this does not tell me which file is broken. I have repeatedly tried restarting broker 102 in the hope that all the reindexing it is doing would recover the files somehow, but no luck.
My guess is that the offending file is not from the partition for which 102 is the dead leader. So I am thinking
a) Can I delete all the log files on 102 for the partitions where 102 is not the leader and when it comes back online it will simply resync without problems?
b) Can I make 102 restart if I somehow locate the right file and remove it?
c) is there any way of figuring out which file Kafka chokes on?
Upvotes: 0
Views: 2139
Reputation: 2229
Having RF=2 and ISR=1 on a 3 node cluster might result in inconsistent state when partition shrinks to 1 node, and having leader change during this time might result to 2 nodes accepts writes as leaders. Therefore you might get 2 versions of history.
To guarantee consistency you might prefer in future RF=3 and ISR=2 with acks=all.
You can try to use DumpLogSegments
utility to check 102
log files for validity and dump data out of them:
bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 000000000000000xxx.log
Parse a log file and dump its contents to the console, useful for debugging a seemingly corrupt log segment.
You would need to check with current partition leader broker which messages are not present there, and republish them.
Upvotes: 1