Reputation: 473
Say I have a dataset like below. I want to group by the ID and IF all of the values are NULL or when summed are not greater than 0 create a new column that marks each row True or False.
The ideal dataset is below further.
ID VALUE
1 6.00
1 5.00
2 NULL
2 NULL
2 NULL
3 3.00
ID VALUE ALL_NULL
1 6.00 False
1 5.00 False
2 NULL True
2 NULL True
2 NULL True
3 3.00 False
Here is what I've tried so far.
df = (
wbcl_full
.withColumn('ALL_NULL', F.when((wbcl_full.groupBy('ID').agg(sum('VALUE').collect()[0][0]) == 0), True)
.otherwise(False))
)
return df
I appreciate your help! Ideally, I have yet to find a good way to do conditional window functions. In some instances when I window over depending on the grouped results I'd like to selectively pick which row that is chosen.
Upvotes: 0
Views: 761
Reputation: 42392
You can do a sum over a window partitioned by ID, and check if the sum is null (which happens only if all elements are null for that ID) or if it is not greater than 0:
from pyspark.sql import functions as F, Window
result = df.withColumn(
'ALL_NULL',
F.sum('VALUE').over(Window.partitionBy('ID')).isNull() |
(F.sum('VALUE').over(Window.partitionBy('ID')) <= 0)
)
result.show()
+---+-----+--------+
| ID|VALUE|ALL_NULL|
+---+-----+--------+
| 1| 6.0| false|
| 1| 5.0| false|
| 3| 3.0| false|
| 2| null| true|
| 2| null| true|
| 2| null| true|
+---+-----+--------+
Upvotes: 1