Reputation: 1167
From the paper "GraphX: Graph Processing in a Distributed Dataflow Framework" (Gonzalez et al. 2014) I learned that GraphX modified Spark shuffle:
Memory-based Shuffle: Spark’s default shuffle implementation materializes the temporary data to disk. We modified the shuffle phase to materialize map outputs in memory and remove this temporary data using a timeout.
(The paper does not explain anything more on this point.)
It seems that this change aims at optimizing shuffles in the context of highly iterative graph processing algorithms.
How does this "Memory-based shuffle" works exactly, how it differs from the Spark Core's one and what are the pros and cons: why it is well suited for graphx use cases and not for other Spark jobs ?
I failed to understand the big picture directly from GraphX/Spark sources and I also struggled finding the information out there.
Apart from an ideal answer, comments with links to sources are welcomed too.
Upvotes: 0
Views: 186
Reputation: 26
I failed to understand the big picture directly from GraphX/Spark sources
Because it was never included in the mainstream distribution.
Back when the first GraphX version was developed Spark used Hash based shuffle, which was rather inefficient. It was one of the main bottlenecks in Spark jobs, and there was significant research into developing of alternative shuffle strategies.
Since GraphX algorithms are iterative and join-based, improving shuffle speed was an obvious path.
Since then, pluggable shuffle manager has been introduced, as well as new sort based shuffle, which finally turned out to be fast enough to make both hash-based shuffle and ongoing work on providing generic memory-based shuffle obsolete.
Upvotes: 1