nbs335
nbs335

Reputation: 9

Pyspark drop duplicates when a column is null

I have a dataset like this :

id name
1 A
2 null
2 B
3 C
3 null
4 null

In case of duplicate IDs, I want to keep the only value with a not null name

In this example, I want to get this table :

id name
1 A
2 B
3 C
4 null

I tried to use the df.dropDuplicates() function, but I don't know how to specify to keep non-null value in "name" column

Upvotes: 0

Views: 47

Answers (1)

s.polam
s.polam

Reputation: 10382

Here is an approach using window & distinct functions to eliminate duplicates.

df
.withColumn(
    "name", 
    expr("FIRST(name, TRUE) OVER(PARTITION BY id ORDER BY id)")
)
.distinct
.show(10, False)

OR using groupBy

df
.groupBy("id")
.agg(first("name", true).as("name"))
.show(10, False)

OR using dropDuplicates

df
.orderBy("name", ascending=False)
.dropDuplicates(["id"])
.show(10, False)
+---+----+
|id |name|
+---+----+
|1  |A   |
|2  |B   |
|3  |C   |
|4  |NULL|
+---+----+

Upvotes: 0

Related Questions