Reputation: 395
Let us assume I have two pyspark dataframes with three partitions,
df1=[[1,2,3],[3,2,1],[2,3,1]]
df2=[[3,2,1],[2,3,1],[1,2,3]]
df1.join(df2,"id").groupby("id").count()
I am performing join and group by operations which means it can have two stages.
after the first stage 200 shuffle partitions will be created in my example 3 partitions will be created and rest are empty partitions
the shuffle partitions looks like this
partition1 :[1,1,1]
partition2 :[2,2,2]
partition3 :[3,3,3]
are these shuffle partitions needs to written to executor disks? so spark in that case is not in-memory computations? why it needs to write the shuffle partitions to the disk? does it use stage1 shuffle partitions in stage 2(group by )?
Upvotes: 1
Views: 331
Reputation: 6998
My initial answer was incorrect. I misread your question, apologies.
You are correct, Spark writes Shuffle results to disk for optimization purposes. Spark computes in memory but stores intermediate results on disk. It does that, because shuffling is very expensive and you can avoid doing it more often by reusing shuffling results that have been persisted on disk.
This is an example where you can leverage that behaviour:
df \
.join(df2, ["id"]) \
.join(df3, ["id"]) \
.join(df4, ["id2"])
is faster than
df \
.join(df2, ["id"]) \
.join(df3, ["id2"]) \
.join(df4, ["id"])
Upvotes: -1