How is spark a memory based solution if it writes data to disk before shuffles?

Question

One of the main "selling points" of Spark is that unlike Hadoop map-reduce that persists intermediate results of its computation to hdfs, Spark keeps all its results in memory. I don't understand this as in reality when a Spark stage finishes it writes all of the data into shuffle files stored on the disk. How then is this an improvement on map-reduce?

krezno · Accepted Answer

I think I understand where Spark saves IO.

in MR the computation looks like:

map -> reduce -> map -> reduce -> map -> reduce ...

which writes results to disk at each such "arrow",

on the other hand in spark we have

map -> reduce + map -> reduce + map -> reduce ...

which saves about 2 times the IO

How is spark a memory based solution if it writes data to disk before shuffles?

Answers (1)

Related Questions