krezno
krezno

Reputation: 480

How is spark a memory based solution if it writes data to disk before shuffles?

One of the main "selling points" of Spark is that unlike Hadoop map-reduce that persists intermediate results of its computation to hdfs, Spark keeps all its results in memory. I don't understand this as in reality when a Spark stage finishes it writes all of the data into shuffle files stored on the disk. How then is this an improvement on map-reduce?

Image taken from Advanced Apache Spark Training - Sameer Farooqui

Upvotes: 0

Views: 141

Answers (1)

krezno
krezno

Reputation: 480

I think I understand where Spark saves IO.

in MR the computation looks like:

map -> reduce -> map -> reduce -> map -> reduce ...

which writes results to disk at each such "arrow",

on the other hand in spark we have

map -> reduce + map -> reduce + map -> reduce ...

which saves about 2 times the IO

Upvotes: 0

Related Questions