Reputation: 79
Condition:
Input:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
AAA | None |
BBB | None |
Desired output:
ID | Score |
---|---|
AAA | High |
AAA | Mid |
BBB | None |
I'm having difficulty in writing the if condition in PySpark. Is there any other way to tackle this problem?
Upvotes: 1
Views: 267
Reputation: 32640
You can count Score
over window partitioned by ID
, then filter on Score
is not null or the count is 0 :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.withColumn(
"count_scores",
F.count("Score").over(Window.partitionBy("ID"))
).where("Score IS NOT NULL OR count_scores = 0")\
.drop("count_scores")
df1.show()
#+---+-----+
#| ID|Score|
#+---+-----+
#|BBB| null|
#|AAA| High|
#|AAA| Mid|
#+---+-----+
Upvotes: 3
Reputation: 42332
You can add a flag of whether all scores are null, and filter the rows where the score is not null or when flag is true (all scores are null):
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'flag',
F.min(F.col('Score').isNull()).over(Window.partitionBy('ID'))
).filter('flag or Score is not null').drop('flag')
df2.show()
+---+-----+
| ID|Score|
+---+-----+
|BBB| null|
|AAA| High|
|AAA| Mid|
+---+-----+
Upvotes: 3