Reputation: 171
I want to use window function to partition by ID and have the last row of each group to be subtracted from the first row and create a separate column with the output. What is the cleanest way to achieve that result?
ID col1
1 1
1 2
1 4
2 1
2 1
2 6
3 5
3 5
3 7
Desired output:
ID col1 col2
1 1 3
1 2 3
1 4 3
2 1 5
2 1 5
2 6 5
3 5 2
3 5 2
3 7 2
Upvotes: 0
Views: 591
Reputation: 26676
Code below
w=Window.partitionBy('ID').orderBy('col1').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('out', last('col1').over(w)-first('col1').over(w)).show()
Upvotes: 1
Reputation: 1666
Sounds like you’re defining the “first” row as the row with the minimum value of col1
in the group, and the “last” row as the row with maximum value of col1
in the group. To compute them, you can use the MIN
and MAX
window functions:
SELECT
ID,
col1,
(MAX(col1) OVER (PARTITION BY ID)) - (MIN(col1) OVER (PARTITION BY ID)) AS col2
FROM
...
If you’re defining “first” and “last” row somehow differently (e.g., in terms of some timestamp), you can use the more general FIRST_VALUE
and LAST_VALUE
window functions:
SELECT
ID,
col1,
(LAST_VALUE(col1) OVER (PARTITION BY ID ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
-
(FIRST_VALUE(col1) OVER (PARTITION BY ID ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
AS col2
FROM
...
The two snippets above are equivalent, but the latter is more general: you can specify ordering by a different column and/or you can modify the window specification.
Upvotes: 1