Reputation: 2473
I have a dataframe with three columns
ID | rank | score
1 | 1 | 20
1 | 2 | 30
2 | 4 | 50
2 | 6 | 100
I would like to select unique ids taking the row with the highest rank:
ID | rank | score
1 | 2 | 30
2 | 6 | 100
I tried this with a groupBy which sort of works, but as soon as include the score column, which differs in each row, then I only get returned the initial dataframe and the grouping based on the rank is lost (obviously).
Here is what I tried
data_grouped = data.groupBy("ID", "score")
data_grouped .agg({"rank":"max"}).orderBy("ID").show()
I feel like this should be achievable relatively easily but my sql / pyspark knowledge is lacking.
Upvotes: 0
Views: 1666
Reputation: 8410
Using a window
will be more efficient and scalable, than using a groupBy. You can compute maximum value of rank in each ID grouping, using max
function and window with partitionBy ID
, and then filter
on rank
and computed maxrank
.
w= Window().partitionBy("ID")
df.withColumn("maxrank", F.max("rank").over(w)).filter("rank=maxrank").drop("maxrank").show()
+---+----+-----+
| ID|rank|score|
+---+----+-----+
| 1| 2| 30|
| 2| 6| 100|
+---+----+-----+
Upvotes: 1
Reputation: 950
You probably need to use the partitionBy()
method:
I believe it does something similar to SQL's PARTITION BY
:
https://www.sqltutorial.org/sql-window-functions/sql-partition-by/
Upvotes: 1