I have a dataset like this : id name 1 A 2 null 2 B 3 C 3 null 4 null In case of duplicate IDs, I want to keep the only value with a not null name In this example, I want to get this table : id name 1 A 2 B 3 C 4 null I tried to use the df.dropDuplicates() function, but I don't know how to specify to keep non-null value in "name" column

Reputation: 9

Pyspark drop duplicates when a column is null

I have a dataset like this :

In case of duplicate IDs, I want to keep the only value with a not null name

In this example, I want to get this table :

I tried to use the df.dropDuplicates() function, but I don't know how to specify to keep non-null value in "name" column

Upvotes: 0

Reputation: 10382

Here is an approach using window & distinct functions to eliminate duplicates.

df
.withColumn(
    "name", 
    expr("FIRST(name, TRUE) OVER(PARTITION BY id ORDER BY id)")
)
.distinct
.show(10, False)

OR using groupBy

df
.groupBy("id")
.agg(first("name", true).as("name"))
.show(10, False)

OR using dropDuplicates

df
.orderBy("name", ascending=False)
.dropDuplicates(["id"])
.show(10, False)

+---+----+
|id |name|
+---+----+
|1  |A   |
|2  |B   |
|3  |C   |
|4  |NULL|
+---+----+

Upvotes: 0