Reputation: 9
I have a dataset like this :
id | name |
---|---|
1 | A |
2 | null |
2 | B |
3 | C |
3 | null |
4 | null |
In case of duplicate IDs, I want to keep the only value with a not null name
In this example, I want to get this table :
id | name |
---|---|
1 | A |
2 | B |
3 | C |
4 | null |
I tried to use the df.dropDuplicates() function, but I don't know how to specify to keep non-null value in "name" column
Upvotes: 0
Views: 47
Reputation: 10382
Here is an approach using window
& distinct
functions to eliminate duplicates.
df
.withColumn(
"name",
expr("FIRST(name, TRUE) OVER(PARTITION BY id ORDER BY id)")
)
.distinct
.show(10, False)
OR using groupBy
df
.groupBy("id")
.agg(first("name", true).as("name"))
.show(10, False)
OR using dropDuplicates
df
.orderBy("name", ascending=False)
.dropDuplicates(["id"])
.show(10, False)
+---+----+
|id |name|
+---+----+
|1 |A |
|2 |B |
|3 |C |
|4 |NULL|
+---+----+
Upvotes: 0