bob
bob

Reputation: 79

How to apply condition in PySpark to keep null only if one else remove nulls

Condition:

Input:

ID Score
AAA High
AAA Mid
AAA None
BBB None

Desired output:

ID Score
AAA High
AAA Mid
BBB None

I'm having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?

Upvotes: 1

Views: 267

Answers (2)

blackbishop
blackbishop

Reputation: 32640

You can count Score over window partitioned by ID, then filter on Score is not null or the count is 0 :

from pyspark.sql import Window
from pyspark.sql import functions as F

df1 = df.withColumn(
    "count_scores",
    F.count("Score").over(Window.partitionBy("ID"))
).where("Score IS NOT NULL OR count_scores = 0")\
 .drop("count_scores")

df1.show()

#+---+-----+
#| ID|Score|
#+---+-----+
#|BBB| null|
#|AAA| High|
#|AAA|  Mid|
#+---+-----+

Upvotes: 3

mck
mck

Reputation: 42332

You can add a flag of whether all scores are null, and filter the rows where the score is not null or when flag is true (all scores are null):

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'flag', 
    F.min(F.col('Score').isNull()).over(Window.partitionBy('ID'))
).filter('flag or Score is not null').drop('flag')

df2.show()
+---+-----+
| ID|Score|
+---+-----+
|BBB| null|
|AAA| High|
|AAA|  Mid|
+---+-----+

Upvotes: 3

Related Questions