Reputation: 51
We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?
Or
Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..
Please suggest.
Upvotes: 5
Views: 3125
Reputation: 1808
Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.
Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!
Upvotes: 6