Reputation: 41
I have a data frame in pyspark which would look like this
|Id1| id2 |row |grp |
|12 | 1234 |1 | 1 |
|23 | 1123 |2 | 1 |
|45 | 2343 |3 | 2 |
|65 | 2345 |1 | 2 |
|67 | 3456 |2 | 2 |```
I need to retrieve value for id2 corresponding to row = 1 and update all id2 values within a grp to that value.
This should be the final result
|Id1 | id2 |row |grp|
|12 |1234 |1 |1 |
|23 |1234 |2 |1 |
|45 |2345 |3 |2 |
|65 |2345 |1 |2 |
|67 |2345 |2 |2 |
I tried doing something like df.groupby('grp').sort('row').first('id2') But apparently sort and orderby don't work with groupby in pyspark.
Any idea how to go about this?
Upvotes: 2
Views: 6023
Reputation: 9247
Very similar to @Steven's answer, without using .rowsBetween
You basically create a Window
for each grp
, then sort the rows by row
and pick the first id2
for each grp
.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('grp').orderBy('row')
df = df.withColumn('id2', F.first('id2').over(w))
df.show()
+---+----+---+---+
|Id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 67|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+
Upvotes: 4
Reputation: 15258
try this :
from pyspark.sql import functions as F, Window as W
df.withColumn(
"id2",
F.first("id2").over(
W.partitionBy("grp")
.orderBy("row")
.rowsBetween(W.unboundedPreceding, W.currentRow)
),
).show()
+---+----+---+---+
|id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 45|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+
Upvotes: 0