GroupBy and Sum, If greater than zero mark row True

Question

Say I have a dataset like below. I want to group by the ID and IF all of the values are NULL or when summed are not greater than 0 create a new column that marks each row True or False.

The ideal dataset is below further.

ID  VALUE
1    6.00
1    5.00
2    NULL
2    NULL
2    NULL
3    3.00

ID  VALUE   ALL_NULL
1    6.00    False
1    5.00    False
2    NULL    True
2    NULL    True
2    NULL    True
3    3.00    False

Here is what I've tried so far.

df = (
    wbcl_full
    .withColumn('ALL_NULL', F.when((wbcl_full.groupBy('ID').agg(sum('VALUE').collect()[0][0]) == 0), True)
    .otherwise(False))
)
return df

I appreciate your help! Ideally, I have yet to find a good way to do conditional window functions. In some instances when I window over depending on the grouped results I'd like to selectively pick which row that is chosen.

mck · Accepted Answer

You can do a sum over a window partitioned by ID, and check if the sum is null (which happens only if all elements are null for that ID) or if it is not greater than 0:

from pyspark.sql import functions as F, Window

result = df.withColumn(
    'ALL_NULL', 
    F.sum('VALUE').over(Window.partitionBy('ID')).isNull() | 
    (F.sum('VALUE').over(Window.partitionBy('ID')) <= 0)
)

result.show()
+---+-----+--------+
| ID|VALUE|ALL_NULL|
+---+-----+--------+
|  1|  6.0|   false|
|  1|  5.0|   false|
|  3|  3.0|   false|
|  2| null|    true|
|  2| null|    true|
|  2| null|    true|
+---+-----+--------+

GroupBy and Sum, If greater than zero mark row True

Answers (1)

Related Questions