Tribhuwan Negi
Tribhuwan Negi

Reputation: 51

Spark Streaming with large number of streams and models used for analytical processing of RDDs

We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?

Or

Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..

Please suggest.

Upvotes: 5

Views: 3125

Answers (1)

Tathagata Das
Tathagata Das

Reputation: 1808

Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.

  1. The distributed datastore (Redis, Cassandra, etc.) can have the data you want to access from all the nodes.
  2. Receive the data streams through a combination data ingestion system (Kafka, Flume, ZeroMQ, etc.) and process it in the stream processing system (Spark Streaming [preferably ;)], Storm, etc.).
  3. In the functions that is used to process the stream records, the necessary data will have to pulled from the data store and maybe cached locally as appropriate.
  4. You may also have to update the data store from spark streaming as application needs it. In which case you will also have to worry about versioning of the data that you want pull in step 3.

Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!

Upvotes: 6

Related Questions