Dexter
Dexter

Reputation: 1750

Spark split dataframe based on logic

I have a use-case where I need to deduplicate a dataframe using a column (it's a GUID column). But instead of dumping the duplicates, I need to store them in a separate location. So for e.g., if we have the following data, with schema (name, GUID): (a, 1), (b, 2), (a, 2), (a, 3), (c, 1), (c, 4). I want to split the dataset such that I have: (a, 1), (b, 2), (a, 3), (c, 4) in 1 part and (a, 2), (c, 1) in second part. If I use dropDuplicates(col("GUID")), the second part gets lost. What would be an efficient way to do this?

Upvotes: 1

Views: 206

Answers (1)

mck
mck

Reputation: 42422

You can assign a row number, and split the dataframe into two parts based on whether the row number is equal to 1.

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rn', 
    F.row_number().over(Window.partitionBy('GUID').orderBy(F.monotonically_increasing_id()))
)
df2.show()
+----+----+---+
|name|GUID| rn|
+----+----+---+
|   a|   1|  1|
|   c|   1|  2|
|   a|   3|  1|
|   b|   2|  1|
|   a|   2|  2|
|   c|   4|  1|
+----+----+---+

df2_part1 = df2.filter('rn = 1').drop('rn')
df2_part2 = df2.filter('rn != 1').drop('rn')

df2_part1.show()
+----+----+
|name|GUID|
+----+----+
|   a|   1|
|   a|   3|
|   b|   2|
|   c|   4|
+----+----+

df2_part2.show()
+----+----+
|name|GUID|
+----+----+
|   c|   1|
|   a|   2|
+----+----+

Upvotes: 2

Related Questions