Kafka Interactive Queries - Accessing large data across instances

Question

We are planning to run kafka streams application distributed in two machines. Each instance stores its Ktable data on its own machine. The challenge we face here is,

We have a million records pushed to Ktable. We need to iterate the whole Ktable (RocksDB) data and generate the report.
Let's say 500K records stored in each instance. It's not possible to get all records from other instance in a single GET over http (unless there is any streaming TCP technique available) . Basically we need two instance data in a single call and generate the report.

Proposed Solution: We are thinking to have a shared location (state.dir) for these two instances.So that these two instances will store the Ktable data at same directory and the idea is to get all data from a single instance without interactive query by just calling,

final ReadOnlyKeyValueStore allDataFromTwoInstance =
        streams.store("result",
            QueryableStoreTypes.keyValueStore())

    KeyValueIterator iterator = allDataFromTwoInstance.all();
    while (iterator.hasNext()) {
       //append to excel report
    }

Question: Will the above solution work without any issues? If not, is there any alternative solution for this?

Please suggest. Thanks in Advance

Michal Borowiecki · Accepted Answer

GlobalKTable is the most natural first choice, but it means each node where the global table is defined contains the entire dataset.

The other alternative that comes to mind is indeed to stream the data between the nodes on demand. This makes sense especially if creating the report is an infrequent operation or when the dataset cannot fit a single node. Basically, you can follow the documentation guidelines for querying remote Kafka Streams nodes here:

http://kafka.apache.org/0110/documentation/streams/developer-guide#streams_developer-guide_interactive-queries_discovery

and for RPC use a framework that supports streaming, e.g. akka-http.

Server-side streaming:

http://doc.akka.io/docs/akka-http/current/java/http/routing-dsl/source-streaming-support.html

Consuming a streaming response:

http://doc.akka.io/docs/akka-http/current/java/http/implications-of-streaming-http-entity.html#client-side-handling-of-streaming-http-entities

Kafka Interactive Queries - Accessing large data across instances

Answers (2)

Related Questions