St.Antario
St.Antario

Reputation: 27385

Understanding shuffle in spark

I'm learning spark and have a question about job scheduling and shuffle dependencies. Here the DAG I found there:

enter image description here

As we can see on the Stage 33 we have multiple operations: groupBy, join, groupBy, join. The question is I don't quite understand why two groupBy operations were put into the same stage? I thought groupBy requires shuffling and we the DAGScheduler should split Stage 33 into 2 stages containing the single groupBy and join.

Upvotes: 3

Views: 3626

Answers (2)

user6860682
user6860682

Reputation:

Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines).

In your case, shuffling happens between parallelize steps - Stages 30, 31, 32 as input and final Stage 33 as a destination in pairs.

Avoid shuffling at all cost. Think about ways to leverage existing partitions, or using Broadcast variables and try to reduce data transfer. More about shuffling in Spark you can read here.

Upvotes: 1

user8542872
user8542872

Reputation: 41

Shuffles here are represented as boundaries between stages:

  • Stage 30 - Stage 33
  • Stage 31 - Stage 33
  • Stage 32 - Stage 33

Once data has been shuffled, and all shuffled RDDs use the same partitioner final join is 1-1 dependency (if all parts have been executed in the same action, it is also local due to collocation) and doesn't require additional shuffle stage.

Upvotes: 4

Related Questions