Reputation: 1750
I have a use-case where I need to deduplicate a dataframe using a column (it's a GUID column). But instead of dumping the duplicates, I need to store them in a separate location. So for e.g., if we have the following data, with schema (name, GUID):
(a, 1), (b, 2), (a, 2), (a, 3), (c, 1), (c, 4)
. I want to split the dataset such that I have:
(a, 1), (b, 2), (a, 3), (c, 4)
in 1 part and (a, 2), (c, 1)
in second part. If I use dropDuplicates(col("GUID")), the second part gets lost. What would be an efficient way to do this?
Upvotes: 1
Views: 206
Reputation: 42422
You can assign a row number, and split the dataframe into two parts based on whether the row number is equal to 1.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(Window.partitionBy('GUID').orderBy(F.monotonically_increasing_id()))
)
df2.show()
+----+----+---+
|name|GUID| rn|
+----+----+---+
| a| 1| 1|
| c| 1| 2|
| a| 3| 1|
| b| 2| 1|
| a| 2| 2|
| c| 4| 1|
+----+----+---+
df2_part1 = df2.filter('rn = 1').drop('rn')
df2_part2 = df2.filter('rn != 1').drop('rn')
df2_part1.show()
+----+----+
|name|GUID|
+----+----+
| a| 1|
| a| 3|
| b| 2|
| c| 4|
+----+----+
df2_part2.show()
+----+----+
|name|GUID|
+----+----+
| c| 1|
| a| 2|
+----+----+
Upvotes: 2