spark shuffle partitions to disk

Question

Let us assume I have two pyspark dataframes with three partitions,

df1=[[1,2,3],[3,2,1],[2,3,1]]
df2=[[3,2,1],[2,3,1],[1,2,3]]
df1.join(df2,"id").groupby("id").count()

I am performing join and group by operations which means it can have two stages.

after the first stage 200 shuffle partitions will be created in my example 3 partitions will be created and rest are empty partitions

the shuffle partitions looks like this

partition1 :[1,1,1]
partition2 :[2,2,2]
partition3 :[3,3,3]

are these shuffle partitions needs to written to executor disks? so spark in that case is not in-memory computations? why it needs to write the shuffle partitions to the disk? does it use stage1 shuffle partitions in stage 2(group by )?

spark shuffle partitions to disk

Answers (1)

Related Questions