jessica
jessica

Reputation: 2590

Data still remains in Kafka topic even after retention time/size

We set the log retention hours to 1 hour as the following (previously setting was 72H)

Using the following Kafka command line tool, we set the kafka retention.ms to 1H. Our aim is to purge the data that is older then 1H in topic - test_topic, so we used the following command:

kafka-configs.sh --alter \
  --zookeeper localhost:2181  \
  --entity-type topics \
  --entity-name topic_test \
  --add-config retention.ms=3600000

and also

kafka-topics.sh --zookeeper localhost:2181 --alter \
  --topic topic_test \
  --config retention.ms=3600000

Both commands ran without errors.

But the problem is about Kafka data that is older then 1H and still remains!

Actually no data was removed from the topic topic_test partitions. We have HDP Kafka cluster version 1.0x and ambari

We do not understand why data on topic - topic_test still remained? and not decreased even after we run both cli as already described

what is wrong on the following kafka cli?

kafka-configs.sh --alter --zookeeper localhost:2181  --entity-type topics  --entity-name topic_test --add-config retention.ms=3600000

kafka-topics.sh --zookeeper localhost:2181 --alter --topic topic_test --config retention.ms=3600000

from the Kafka server.log we ca see the following

2020-07-28 14:47:27,394] INFO Processing override for entityPath: topics/topic_test with config: Map(retention.bytes -> 2165441552, retention.ms -> 3600000) (kafka.server.DynamicConfigManager)
[2020-07-28 14:47:27,397] WARN retention.ms for topic topic_test is set to 3600000. It is smaller than message.timestamp.difference.max.ms's value 9223372036854775807. This may result in frequent log rolling. (kafka.server.TopicConfigHandler)

reference - https://ronnieroller.com/kafka/cheat-sheet

Upvotes: 6

Views: 7267

Answers (1)

Michael Heil
Michael Heil

Reputation: 18475

The log cleaner will only work on inactive (sometimes also referred to as "old" or "clean") segments. As long as all data fits into the active ("dirty", "unclean") segment where its size is defined by segment.bytes size limit there will be no cleaning happening.

The configuration cleanup.policy is described as:

A string that is either "delete" or "compact" or both. This string designates the retention policy to use on old log segments. The default policy ("delete") will discard old segments when their retention time or size limit has been reached. The "compact" setting will enable log compaction on the topic.

In addition, the segment.bytes is:

This configuration controls the segment file size for the log. Retention and cleaning is always done a file at a time so a larger segment size means fewer files but less granular control over retention.

The configuration segment.ms can also be used to steer the deletion:

This configuration controls the period of time after which Kafka will force the log to roll even if the segment file isn't full to ensure that retention can delete or compact old data.

As it defaults to one week, you might want to reduce it to fit your needs.

Therefore, if you want to set the retention of a topic to e.g. one hour you could set:

cleanup.policy=delete
retention.ms=3600000
segment.ms=3600000
file.delete.delay.ms=1 (The time to wait before deleting a file from the filesystem)
segment.bytes=1024

Note: I am not referring to retention.bytes. The segment.bytes is a very different thing as described above. Also, be aware that log.retention.hours is a cluster-wide configuration. So, if you plan to have different retention times for different topics this will solve it.

Upvotes: 16

Related Questions