Reputation: 10540
Everywhere on google the key difference between Spark and Hadoop MapReduce is stated in the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. Looks like I got it, but I would like to confirm it with an example.
Consider this word count example:
val text = sc.textFile("mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect
My understanding:
In case of Spark, once the lines are split by " ", output will be stored in memory. Similarly with functions map and reduce. I believe same is true for when processing is happening across partitions.
In the case of MapReduce , will each intermediate results (like words after split/map/reduce) be kept on disk i.e. HDFS, which makes it slower compared to Spark? There is no way we can keep them in memory ? Same is the case of partitions results ?
Upvotes: 0
Views: 362
Reputation: 18108
Yes, you are right.
The SPARK intermediate RDD (Resilient Distributed Dataset) results are kept in memory and hence latency is a lot lower and job throughput higher. RDDs have partitions, chunks of data like MR. SPARK also offers iterative processing, also a key point to consider.
MR does have a Combiner of course to ease the pain a little.
But SPARK is far easier to use as well, with Scala or pyspark.
I would not worry about MR anymore - in general.
Here is an excellent read on SPARK BTW: https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454
Upvotes: 0