Reputation: 41
Is there some way to get deterministic results from dataframe repartition without sorting? In the below code, I get different results while doing the same operation.
from pyspark.sql.functions import rand, randn
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.range(0, 100000)
# repartition dataframe to 5 partitions
df2 = df.repartition(5).persist()
df2.head(5)
Out[1]: [Row(id=5324), Row(id=5389), Row(id=6209), Row(id=7640), Row(id=8090)]
df2.unpersist()
df3 = df.repartition(5).persist()
df3.head(5)
Out[2]: [Row(id=1019), Row(id=652), Row(id=2287), Row(id=470), Row(id=1348)]
Spark Version - 2.4.5
Upvotes: 1
Views: 1808
Reputation: 9417
According to this JIRA, repartitioning (by default) involves a local sort, and is fully deterministic. From the PR notes:
In this PR, we propose ... performing a local sort before partitioning, after we make the input row ordering deterministic, the function from rows to partitions is fully deterministic too.
The downside of the approach is that with extra local sort inserted, the performance of repartition() will go down, so we add a new config named
spark.sql.execution.sortBeforeRepartition
to control whether this patch is applied. The patch is default enabled to be safe-by-default, but user may choose to manually turn it off to avoid performance regression.
head(n)
on the other hand is not (unless you apply orderBy
which again repartitions dataset to one partition), but that isn't your concern right?
Upvotes: 1
Reputation: 451
This non deterministic
behaviour is expected
. Here's how...
.repartition(num)
does a round-robin
repartitioning when no columns are passed inside the function. This does not guarantee that a specific row will always be in a particular partition.
.head(n)
returns first n rows of first partition of dataframe.
If you want an order, you need to use orderBy
!
Upvotes: 1