CuriousMind
CuriousMind

Reputation: 8943

Purpose of statestore and changelog topic in kafka streams?

I have a kafka stream application in which it is using stateStore (backed by RocksDB).

All what stream thread is doing is getting data from kafka topic and putting the data into state-store. (There is other thread which read data from statestore and does business logic processing).

I observed it creates a new kafka topic "changelog" because of stateStore.

But I didn't get what purpose "changelog" kafka topic serves?

Upvotes: 7

Views: 12080

Answers (2)

tharindu_DG
tharindu_DG

Reputation: 9291

When you enable change logging for a state store, Kafka Streams captures changes to the state and writes them to a changelog topic in Kafka. This changelog topic acts as a durable and fault-tolerant storage for the state, allowing the state to be restored in case of application restarts or failures.

Lets take word count example.

Initial State:

  • Word: "hello", Count: 1
  • Word: "world", Count: 1

Change Log Entries:

When a word is processed multiple times, the state store updates the count for that word, and these updates are written to the changelog topic.

  • Update for "hello":
    • Word: "hello", Count: 2
  • Update for "world":
    • Word: "world", Count: 2
  • Update for "hello" again:
    • Word: "hello", Count: 3

Changelog Topic:

The changelog topic for the word-count-store might contain records like the following

  • Key: "hello", Value: 1 (Initial state)
  • Key: "world", Value: 1 (Initial state)
  • Key: "hello", Value: 2 (Update)
  • Key: "world", Value: 2 (Update)
  • Key: "hello", Value: 3 (Update)

Restore State:

If the Kafka Streams application restarts or fails over to another instance, it can restore the state of the word-count-store by replaying the changelog topic from the beginning. This ensures that the state is consistent and up-to-date across application instances.

Compact Topics:

To optimize storage and reduce the volume of change log data, it can be configured to use log compaction. This ensures that only the latest update for each key is retained in the changelog topic, allowing the state to be fully restored while minimizing storage requirements.

Upvotes: 2

Lalit
Lalit

Reputation: 2014

Short answer to this question is to achieve fault tolerance.

Details:

changelog enables the State Store in your Kafka Streams application to be fault tolerant. As your application ingests more data into the state store, it gets pushed to the changelog topic, so that if the node that is running the application goes down, then the changelog topic is used to load the state store with the latest state.

Each application thread or instance gets it's own changelog topic partition so that every instance can recreate it's state after the application is restarted post failure.

The data is getting pushed to the topic automatically by Kafka Streams as and when there are updates made to the state store.

I would suggest going through the Chapter 11 of Kafka Definitive Guide - it contains a pretty good explanation of the Kafka Streams architecture and the stream processing patterns.

Hope this helps.

Upvotes: 12

Related Questions