Why there are so many partitions required before shuffling data in Apache Spark?

Question

Background

I am a newbie in Spark and want to understand about shuffling in spark. I have two following questions about shuffling in Apache Spark.

1) Why there is change in no. of partitions before performing shuffling ? Spark does it by default by changing partition count to value given in spark.sql.shuffle.partitions.

2) Shuffling usually happens when there is a wide transformation. I have read in a book that data is also saved on disk. Is my understanding correct ?

Ged · Accepted Answer

Two questions actually.

Nowhere it it stated that you need to change this parameter. 200 is the default if not set. It applies to JOINing and AGGregating. You make have a far bigger set of data that is better served by increasing the number of partitions for more processing capacity - if more Executors are available. 200 is the default, but if your quantity is huge, more parallelism if possible will speed up processing time - in general.
Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially:
DAG dependency involving a shuffle means creation of a separate Stage.
Map operations are followed by Reduce operations and a Map and so forth.

CURRENT STAGE

All the (fused) Map operations are performed intra-Stage.

The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map operations of current Stage.

This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have thought in memory was possible, if data is small, but this is an architectural Spark approach as stated from the docs.)

The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all keys/locations once all of the map side work is done.

NEXT STAGE

The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.

The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.

Stages mean writing to disk, even if enough memory present. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' style of implementation.
Of course, fault tolerance is aided by this persistence, less re-computation work.

Similar aspects apply to DFs.

Why there are so many partitions required before shuffling data in Apache Spark?

Answers (1)

Related Questions