Alexander
Alexander

Reputation: 621

Faster building of Kafka Streams state

I have default 7 days of latest streaming data stored in Kafka:

log.retention.hours=168

When deploying new version of Streams application, it takes significant amount of time to process the old data before being able to actually use it.

Are there any options to make it quicker other than reducing the retention period?

What comes to my mind is that state stores shouldn't be persisted to disk until all data is processed.

Upvotes: 0

Views: 823

Answers (2)

Alexander
Alexander

Reputation: 621

What I finally came up with is processing only last N hours of original data in my Streams application using filter:

myStream.filter({ (_, value) =>
        val calendar = Calendar.getInstance()
        calendar.add(Calendar.HOUR, -streamHours)

        value.timestamp > calendar.getTimeInMillis
      })

Upvotes: 0

Emil
Emil

Reputation: 107

I'm guessing you have state-stores with changelog topics in your app, and the thing that takes time is restoring the state of the app?

  1. Even if the input topic has retention set, changelog topics have cleanup.policy set to compact by default, so unlimited retention.
  2. What is the size of the keyset? A changelog topic consist of the number of keys you store, you can try reducing this to get a smaller state.
  3. Consider changing the segment.ms and min.cleanable.dirty.ratio to optimize for compacting.
  4. Consider tuning rocksDB config

Upvotes: 2

Related Questions