Sathish
Sathish

Reputation: 245

Kafka Interactive Queries - Accessing large data across instances

We are planning to run kafka streams application distributed in two machines. Each instance stores its Ktable data on its own machine. The challenge we face here is,

  1. We have a million records pushed to Ktable. We need to iterate the whole Ktable (RocksDB) data and generate the report.
  2. Let's say 500K records stored in each instance. It's not possible to get all records from other instance in a single GET over http (unless there is any streaming TCP technique available) . Basically we need two instance data in a single call and generate the report.

Proposed Solution: We are thinking to have a shared location (state.dir) for these two instances.So that these two instances will store the Ktable data at same directory and the idea is to get all data from a single instance without interactive query by just calling,

final ReadOnlyKeyValueStore<Key, Result> allDataFromTwoInstance =
        streams.store("result",
            QueryableStoreTypes.<Key, Result>keyValueStore())

    KeyValueIterator<Key, ReconResult> iterator = allDataFromTwoInstance.all();
    while (iterator.hasNext()) {
       //append to excel report
    }

Question: Will the above solution work without any issues? If not, is there any alternative solution for this?

Please suggest. Thanks in Advance

Upvotes: 1

Views: 1249

Answers (2)

Matthias J. Sax
Matthias J. Sax

Reputation: 62285

This will not work. Even if you have a shared state.dir, each instance only loads its own share/shard of the data and is not aware about the other data.

I think you should use GlobalKTable to get a full local copy of the data.

Upvotes: 3

Michal Borowiecki
Michal Borowiecki

Reputation: 4314

GlobalKTable is the most natural first choice, but it means each node where the global table is defined contains the entire dataset.

The other alternative that comes to mind is indeed to stream the data between the nodes on demand. This makes sense especially if creating the report is an infrequent operation or when the dataset cannot fit a single node. Basically, you can follow the documentation guidelines for querying remote Kafka Streams nodes here:

http://kafka.apache.org/0110/documentation/streams/developer-guide#streams_developer-guide_interactive-queries_discovery

and for RPC use a framework that supports streaming, e.g. akka-http.

Server-side streaming:

http://doc.akka.io/docs/akka-http/current/java/http/routing-dsl/source-streaming-support.html

Consuming a streaming response:

http://doc.akka.io/docs/akka-http/current/java/http/implications-of-streaming-http-entity.html#client-side-handling-of-streaming-http-entities

Upvotes: 3

Related Questions