lejcestlesang
lejcestlesang

Reputation: 75

Window function sum, multiplied by condition

I am reviewing a code and would love to have a bit more clarity.

Here is my PySpark Dataframe:

YEAR_A YEAR_B AMOUNT
2000 2001 5
2000 2000 4
2000 2001 3

I initiate a window function:

window = Window.partitionBy('YEAR_A')

Then I would love some help to understand the following part, especially after the over(window).

df = (df.withColumn("newcolumn", F.sum("AMOUNT").over(window) *(F.col("YEAR_B") == F.col("YEAR_A")).cast("integer")))

Is it supposed to create a "newcolumn" to my dataframe with the sum of "AMOUNT" of the current YEAR_A and write it only if "YEAR_A" is equal to "YEAR_B" (otherwise write nan)? or am I missing something?

Upvotes: 0

Views: 131

Answers (1)

ZygD
ZygD

Reputation: 24386

(F.col("YEAR_B") == F.col("YEAR_A")) compares both columns. If the values in the row are equal, you get True, if they are not equal, you get False.

.cast("integer") makes the integer out of the previous result. True becomes 1, False becomes 0.

F.sum("AMOUNT").over(window) * - you multiply the result of the window function with the result of above. When you multiply by 1, you get the value of the window function. When you multiply by 0, you get 0.

There's nothing written about nan. Spark does not return nan generally.

Upvotes: 1

Related Questions