cnns
cnns

Reputation: 171

PySpark subtract last row from first row in a group

I want to use window function to partition by ID and have the last row of each group to be subtracted from the first row and create a separate column with the output. What is the cleanest way to achieve that result?

ID   col1     
1     1     
1     2     
1     4     
2     1     
2     1     
2     6 
3     5
3     5
3     7

Desired output:

ID   col1   col2  
1     1      3
1     2      3
1     4      3
2     1      5
2     1      5
2     6      5
3     5      2
3     5      2
3     7      2

Upvotes: 0

Views: 591

Answers (2)

wwnde
wwnde

Reputation: 26676

Code below

  w=Window.partitionBy('ID').orderBy('col1').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    df.withColumn('out', last('col1').over(w)-first('col1').over(w)).show()

Upvotes: 1

d125q
d125q

Reputation: 1666

Sounds like you’re defining the “first” row as the row with the minimum value of col1 in the group, and the “last” row as the row with maximum value of col1 in the group. To compute them, you can use the MIN and MAX window functions:

SELECT
    ID,
    col1,
    (MAX(col1) OVER (PARTITION BY ID)) - (MIN(col1) OVER (PARTITION BY ID)) AS col2
FROM
    ...

If you’re defining “first” and “last” row somehow differently (e.g., in terms of some timestamp), you can use the more general FIRST_VALUE and LAST_VALUE window functions:

SELECT
    ID,
    col1,
    (LAST_VALUE(col1) OVER (PARTITION BY ID ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
    -
    (FIRST_VALUE(col1) OVER (PARTITION BY ID ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
    AS col2
FROM
    ...

The two snippets above are equivalent, but the latter is more general: you can specify ordering by a different column and/or you can modify the window specification.

Upvotes: 1

Related Questions