Reputation: 75
I am reviewing a code and would love to have a bit more clarity.
Here is my PySpark Dataframe:
YEAR_A | YEAR_B | AMOUNT |
---|---|---|
2000 | 2001 | 5 |
2000 | 2000 | 4 |
2000 | 2001 | 3 |
I initiate a window function:
window = Window.partitionBy('YEAR_A')
Then I would love some help to understand the following part, especially after the over(window)
.
df = (df.withColumn("newcolumn", F.sum("AMOUNT").over(window) *(F.col("YEAR_B") == F.col("YEAR_A")).cast("integer")))
Is it supposed to create a "newcolumn" to my dataframe with the sum of "AMOUNT" of the current YEAR_A and write it only if "YEAR_A" is equal to "YEAR_B" (otherwise write nan)? or am I missing something?
Upvotes: 0
Views: 131
Reputation: 24386
(F.col("YEAR_B") == F.col("YEAR_A"))
compares both columns. If the values in the row are equal, you get True
, if they are not equal, you get False
.
.cast("integer")
makes the integer out of the previous result. True
becomes 1
, False
becomes 0
.
F.sum("AMOUNT").over(window) *
- you multiply the result of the window function with the result of above. When you multiply by 1
, you get the value of the window function. When you multiply by 0
, you get 0
.
There's nothing written about nan
. Spark does not return nan
generally.
Upvotes: 1