xiexieni9527
xiexieni9527

Reputation: 131

Applying PySpark dropDuplicates method messes up the sorting of the data frame

I'm not sure why this is the behaviour, but when I apply dropDuplicates to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison.

The following table is the output of sorted_df.show(), in which the sorting is in order.

+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
|         1|          1|
|         8|          5|
|        15|          1|
|        19|          9|
|        20|          7|
|        27|          9|
|        67|          8|
|        91|          9|
|        91|          7|
|        91|          1|
+----------+-----------+

The following table is the output of sorted_df.dropDuplicates().show(), and the sorting is not right anymore, even though it's the same data frame.

+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
|        27|          9|
|        67|          8|
|        15|          1|
|        91|          7|
|         1|          1|
|        91|          1|
|         8|          5|
|        91|          9|
|        20|          7|
|        19|          9|
+----------+-----------+

Can someone explain why this behaviour persists and how can I keep the same sorting order with dropDuplicates applied?

Apache Spark version 3.1.2

Upvotes: 0

Views: 292

Answers (1)

Ged
Ged

Reputation: 18098

dropDuplicates involves a shuffle. Ordering is therefore disrupted.

Upvotes: 2

Related Questions