Spark as the storage layer for Mapreduce

Question

I am facing a unique problem, and wanted your opinions here.

I have a legacy map-reduce application, where multiple map-reduce jobs run sequentially, the intermediate data is written back and forth to HDFS. Because of intermediate data written to HDFS, the jobs with small data lose more than gain from HDFS's features, and take considerably more time than what a non-Hadoop equivalent would have taken. Eventually I plan to convert all my map reduce jobs to Spark DAGs, however that's a big-bang change, so I am reasonably procrastinating.

What I really want as a short term solution is that, change the storage layer, so that I continue to benefit from map-reduce parallelism, but do not pay much penalty for storage layer. In that direction, I am thinking of using Spark as the storage layer, where map-reduce jobs will store their outputs in Spark through Spark Context, and the inputs will be read again (by creating Spark input split, each split will have it's own Spark RDD) from Spark Context.

In this way, I will be able to operate intermediate data read/write at memory speed, which will theoretically give me significant performance improvement.

My question is, does this architectural scheme make sense? Has anyone encountered situations like this? Am I missing something significant, which I should have considered even at this preliminary stage of the solution?

Thanks in advance!

Spark as the storage layer for Mapreduce

Answers (1)

Related Questions