user2250246
user2250246

Reputation: 3967

Can Kafka compaction overwrite messages with same partition key?

I am using following code to write to Kafka:

String partitionKey = "" + System.currentTimeMillis();
KeyedMessage<String, String> data = new KeyedMessage<String, String>(topic, partitionKey, payload);

And we are using 0.8.1.1 version of Kafka.

Is it possible that when multiple threads are writing, some of them (with different payload) write with same partition key and because of that Kafka overwrites these messages (due to same partitionKey)?

The documentation that got us thinking in this direction is: http://kafka.apache.org/documentation.html#compaction

Upvotes: 2

Views: 7453

Answers (2)

user2250246
user2250246

Reputation: 3967

I found some more material at https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction

Salient points:

  1. Before 0.8 version, Kafka supported only a single retention mechanism: deleting old segments of log
  2. Log compaction provides an alternative such that it maintains the most recent entry for each unique key, rather than maintaining only recent log entries.
  3. There is a per-topic option to choose either "delete" or "compact".
  4. Compaction guarantees that each key is unique in the tail of the log. It works by recopying the log from beginning to end, removing keys which have a later occurrence in the log.
  5. Any consumer that stays within the head of the log (~1GB) will see all messages.

So whether we have log compaction or not, it follows that Kafka deletes older records but the records in the head of the log are safe from that.

Missing records problem will occur only when downstream clients are unable to empty Kafka queues for a very long time (such that per topic size/time limit is hit).

This should be an expected behavior I think since we cannot keep records forever. They have to be deleted some time or the other.

Upvotes: 6

Gwen Shapira
Gwen Shapira

Reputation: 5168

Sounds very possible. Compaction saves the last message for each key. If you have multiple messages sharing a key, only the last one will be saved after compaction. The normal use-case is database replication where only the latest state is interesting.

Upvotes: 3

Related Questions