Anthony Vinay
Anthony Vinay

Reputation: 647

Kafka Topic Retention and impact on the State store in Kafka streams

I have a state store (using Materialized.as()) in Kafka streams word-count application.
Based on my understanding the state-store is maintained in Kafka internal topic.


Following questions are :

  1. Can state-stores have unlimited key-value pairs, or they are governed by the rules of kafka topics based on the log.retention policies or log.segment.bytes?
  2. I set the log.retention.ms=60000 and expected the state store value to be reset to 0 after a minute. But I find that it is not happening, I can still see values from state store. Does kafka completely wipe out the logs or keeps the SNAPSHOT in case log-compaction topic?
  3. What does it mean by "segment gets committed"?

Please post along with the sources for solution if available.

Upvotes: 1

Views: 2920

Answers (2)

miguno
miguno

Reputation: 15067

Can state-stores have unlimited key-value pairs, or they are governed by the rules of kafka topics based on the log.retention policies or log.segment.bytes?

Yes, state stores can have unlimited key-value pairs = events (or 'messages'). Well, local app storage space and remote storage space in Kafka permitting, of course (the latter for durably storing the data in your state stores).

Your application's state stores are persisted remotely in compacted internal Kafka topics. Compaction means that Kafka periodically purges older events for the same event key (e.g., Bob's old account balances) from storage. But compacted topics do not remove the most recent event per event key (e.g., Bob's current account balance). There is no upper limit for how many such 'unique' key-value pairs will be stored in a compacted topic.

I set the log.retention.ms=60000 and expected the state store value to be reset to 0 after a minute. But I find that it is not happening, I can still see values from state store.

log.retention.ms is not used when a topic is configured to be compacted (log.cleanup.policy=compact). See the existing SO question Log compaction to keep exactly one message per key for details, including for why compaction doesn't happen immediately (in short that's because compaction operates on partition segment files, it will not touch the most current segment file, and there can be multiple events per event key in that file).

Note: You can nowadays set the configuration log.cleanup.policy to a combination of compaction and time/volume-based retention with log.cleanup.policy=compact,delete (see KIP-71 for details). But generally you shouldn't fiddle with this setting unless you really know what you are doing -- the defaults are what you need 99% of the time.

Does kafka completely wipe out the logs or keeps the SNAPSHOT in case log-compaction topic? What does it mean by "segment gets committed"?

I don't understand this question, unfortunately. :-) Perhaps my previous answers and reference links already cover your needs. What I can say is that, no, Kafka doesn't wipe out the logs completely. Compaction operates on a topic-partition's segment files. You will probably need to read up on how compaction works, for which I would suggest an article like https://medium.com/@sunny_81705/kafka-log-retention-and-cleanup-policies-c8d9cb7e09f8, in case the Apache Kafka docs weren't sufficiently clear yet.

Upvotes: 6

OneCricketeer
OneCricketeer

Reputation: 191743

State stores are maintained by compacted, internal topics. They therefore follow the same semantics of compacted topics, and have to finite retention duration

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Management

Upvotes: 1

Related Questions