user1808602
user1808602

Reputation: 41

Spark non-deterministic results after repartition

Is there some way to get deterministic results from dataframe repartition without sorting? In the below code, I get different results while doing the same operation.

from pyspark.sql.functions import rand, randn
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.range(0, 100000)

# repartition dataframe to 5 partitions
df2 = df.repartition(5).persist()
df2.head(5)

Out[1]: [Row(id=5324), Row(id=5389), Row(id=6209), Row(id=7640), Row(id=8090)]

df2.unpersist()
df3 = df.repartition(5).persist()
df3.head(5)

Out[2]: [Row(id=1019), Row(id=652), Row(id=2287), Row(id=470), Row(id=1348)]

Spark Version - 2.4.5

Upvotes: 1

Views: 1808

Answers (2)

mazaneicha
mazaneicha

Reputation: 9417

According to this JIRA, repartitioning (by default) involves a local sort, and is fully deterministic. From the PR notes:

In this PR, we propose ... performing a local sort before partitioning, after we make the input row ordering deterministic, the function from rows to partitions is fully deterministic too.

The downside of the approach is that with extra local sort inserted, the performance of repartition() will go down, so we add a new config named spark.sql.execution.sortBeforeRepartition to control whether this patch is applied. The patch is default enabled to be safe-by-default, but user may choose to manually turn it off to avoid performance regression.

head(n) on the other hand is not (unless you apply orderBy which again repartitions dataset to one partition), but that isn't your concern right?

Upvotes: 1

Samir Vyas
Samir Vyas

Reputation: 451

This non deterministic behaviour is expected. Here's how...

  1. .repartition(num) does a round-robin repartitioning when no columns are passed inside the function. This does not guarantee that a specific row will always be in a particular partition.

  2. .head(n) returns first n rows of first partition of dataframe.

If you want an order, you need to use orderBy !

Upvotes: 1

Related Questions