code_bug
code_bug

Reputation: 395

spark shuffle partitions to disk

Let us assume I have two pyspark dataframes with three partitions,

df1=[[1,2,3],[3,2,1],[2,3,1]]
df2=[[3,2,1],[2,3,1],[1,2,3]]
df1.join(df2,"id").groupby("id").count()

I am performing join and group by operations which means it can have two stages.

after the first stage 200 shuffle partitions will be created in my example 3 partitions will be created and rest are empty partitions

the shuffle partitions looks like this

partition1 :[1,1,1]
partition2 :[2,2,2]
partition3 :[3,3,3]

are these shuffle partitions needs to written to executor disks? so spark in that case is not in-memory computations? why it needs to write the shuffle partitions to the disk? does it use stage1 shuffle partitions in stage 2(group by )?

Upvotes: 1

Views: 331

Answers (1)

Robert Kossendey
Robert Kossendey

Reputation: 6998

My initial answer was incorrect. I misread your question, apologies.

You are correct, Spark writes Shuffle results to disk for optimization purposes. Spark computes in memory but stores intermediate results on disk. It does that, because shuffling is very expensive and you can avoid doing it more often by reusing shuffling results that have been persisted on disk.

This is an example where you can leverage that behaviour:

df \
   .join(df2, ["id"]) \
   .join(df3, ["id"]) \
   .join(df4, ["id2"])

is faster than

df \
   .join(df2, ["id"]) \
   .join(df3, ["id2"]) \
   .join(df4, ["id"])

Upvotes: -1

Related Questions