Reputation: 33
I want know how to create a new column on the full DataFrame but whose values are based on only a subset of the DataFrame (i.e. some function applied based on a condition).
In this example, I want to create a column that associates each ID with the sum of only positive values in the Value column by ID.
Using this is the dataset:
+-----+-----+
| ID |Value|
+-----+-----+
| 1 | -4 |
+-----+-----+
| 2 | 5 |
+-----+-----+
| 2 | -23 |
+-----+-----+
| 1 | 5 |
+-----+-----+
| 2 | 19 |
+-----+-----+
| 1 | 12 |
+-----+-----+
I'm looking to get the following result:
+-----+-----+-------------+
| ID |Value| sum_pos_val |
+-----+-----+-------------+
| 1 | -4 | 17 |
+-----+-----+-------------+
| 2 | 5 | 24 |
+-----+-----+-------------+
| 2 | -23 | 24 |
+-----+-----+-------------+
| 1 | 5 | 17 |
+-----+-----+-------------+
| 2 | 19 | 24 |
+-----+-----+-------------+
| 1 | 12 | 17 |
+-----+-----+-------------+
The following code will sum the Value column by ID but how can I filter out only positive values to sum over?
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('ID')
df.withColumn('sum_pos_val', f.sum('Value').over(w))\
.show()
Upvotes: 0
Views: 697
Reputation: 11
Replace
f.sum('Value').over(w)
with
f.sum(
f.when(f.col('Value') > 0, f.col('Value')).otherwise(0)
).over(w)
or even simpler (see spark dataframe groupping does not count nulls)
f.sum(
f.when(f.col('Value') > 0, f.col('Value'))
).over(w)
Upvotes: 1