How to apply condition in PySpark to keep null only if one else remove nulls

Question

Condition:

If ID has a Score 'High' or 'Mid' -> remove None
If ID only has Score None -> just keep None

Input:

ID	Score
AAA	High
AAA	Mid
AAA	None
BBB	None

Desired output:

ID	Score
AAA	High
AAA	Mid
BBB	None

I'm having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?

blackbishop · Accepted Answer

You can count Score over window partitioned by ID, then filter on Score is not null or the count is 0 :

from pyspark.sql import Window
from pyspark.sql import functions as F

df1 = df.withColumn(
    "count_scores",
    F.count("Score").over(Window.partitionBy("ID"))
).where("Score IS NOT NULL OR count_scores = 0")\
 .drop("count_scores")

df1.show()

#+---+-----+
#| ID|Score|
#+---+-----+
#|BBB| null|
#|AAA| High|
#|AAA|  Mid|
#+---+-----+

How to apply condition in PySpark to keep null only if one else remove nulls

Answers (2)

Related Questions