Reputation: 132
I have a business requirement where I need to have windowing of 12 hours and one needs to query the stream data. The volume is around 100M records in 12 hours. Also I need to maintain the ordering of all the events. Using the Streams API I built up a system to do this. Volume doesn't seem to be an issue. The real issue is that business want to search through the events and within state stores, almost every state store. Search is not key based but based on some fields in the value.
I tried KSQL server and tried running simple queries with a data set of 25M records and running queries over an 8 hour window took almost 240 seconds to complete the search. (Right now I am using a single node and single partition.)
The other way I'm thinking of is to have Elastic Search hooked up to the streams and state stores and then run queries on them, but I'm not sure if storing the data of every state store will be a good solution or not.
I would just like to get the opinion from the community about what is the best approach to query a stream with this kind of volume and with the requirement of low response times.
I'm still new to Kafka and looking forward to suggestions and guidance.
Upvotes: 1
Views: 1086
Reputation: 3433
Kafka itself isn't optimized for indexed queries, or even any queries that don't involve starting at an offset and reading forward in the log. The best way to query the data is to sink it out to systems that fit your query requirements.
Kafka Streams does support interactive queries, but if, as you say you'll need to index the data on fields rather than keys, you're likely better off writing to a system that supports secondary indexes.
Upvotes: 2