PySpark subtract last row from first row in a group

Question

I want to use window function to partition by ID and have the last row of each group to be subtracted from the first row and create a separate column with the output. What is the cleanest way to achieve that result?

Desired output:

ID   col1   col2  
1     1      3
1     2      3
1     4      3
2     1      5
2     1      5
2     6      5
3     5      2
3     5      2
3     7      2

wwnde · Accepted Answer

Code below

  w=Window.partitionBy('ID').orderBy('col1').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    df.withColumn('out', last('col1').over(w)-first('col1').over(w)).show()

PySpark subtract last row from first row in a group

Answers (2)

Related Questions