Reputation: 480
One of the main "selling points" of Spark is that unlike Hadoop map-reduce that persists intermediate results of its computation to hdfs, Spark keeps all its results in memory. I don't understand this as in reality when a Spark stage finishes it writes all of the data into shuffle files stored on the disk. How then is this an improvement on map-reduce?
Upvotes: 0
Views: 141
Reputation: 480
I think I understand where Spark saves IO.
in MR the computation looks like:
map -> reduce -> map -> reduce -> map -> reduce ...
which writes results to disk at each such "arrow",
on the other hand in spark we have
map -> reduce + map -> reduce + map -> reduce ...
which saves about 2 times the IO
Upvotes: 0