select pyspark dataframe rows based on result of groupBy

Question

I have a dataframe with three columns

ID | rank | score
1  | 1    | 20
1  | 2    | 30
2  | 4    | 50
2  | 6    | 100

I would like to select unique ids taking the row with the highest rank:

ID | rank | score
1  | 2    | 30
2  | 6    | 100

I tried this with a groupBy which sort of works, but as soon as include the score column, which differs in each row, then I only get returned the initial dataframe and the grouping based on the rank is lost (obviously).

Here is what I tried

data_grouped = data.groupBy("ID", "score")
data_grouped .agg({"rank":"max"}).orderBy("ID").show()

I feel like this should be achievable relatively easily but my sql / pyspark knowledge is lacking.

murtihash · Accepted Answer

Using a window will be more efficient and scalable, than using a groupBy. You can compute maximum value of rank in each ID grouping, using max function and window with partitionBy ID, and then filter on rank and computed maxrank.

w= Window().partitionBy("ID")
df.withColumn("maxrank", F.max("rank").over(w)).filter("rank=maxrank").drop("maxrank").show()

+---+----+-----+
| ID|rank|score|
+---+----+-----+
|  1|   2|   30|
|  2|   6|  100|
+---+----+-----+

select pyspark dataframe rows based on result of groupBy

Answers (2)

Related Questions