Reputation: 131
I'm not sure why this is the behaviour, but when I apply dropDuplicates
to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison.
The following table is the output of sorted_df.show()
, in which the sorting is in order.
+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
| 1| 1|
| 8| 5|
| 15| 1|
| 19| 9|
| 20| 7|
| 27| 9|
| 67| 8|
| 91| 9|
| 91| 7|
| 91| 1|
+----------+-----------+
The following table is the output of sorted_df.dropDuplicates().show()
, and the sorting is not right anymore, even though it's the same data frame.
+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
| 27| 9|
| 67| 8|
| 15| 1|
| 91| 7|
| 1| 1|
| 91| 1|
| 8| 5|
| 91| 9|
| 20| 7|
| 19| 9|
+----------+-----------+
Can someone explain why this behaviour persists and how can I keep the same sorting order with dropDuplicates
applied?
Apache Spark version 3.1.2
Upvotes: 0
Views: 292