Aesir
Aesir

Reputation: 2473

select pyspark dataframe rows based on result of groupBy

I have a dataframe with three columns

ID | rank | score
1  | 1    | 20
1  | 2    | 30
2  | 4    | 50
2  | 6    | 100

I would like to select unique ids taking the row with the highest rank:

ID | rank | score
1  | 2    | 30
2  | 6    | 100

I tried this with a groupBy which sort of works, but as soon as include the score column, which differs in each row, then I only get returned the initial dataframe and the grouping based on the rank is lost (obviously).

Here is what I tried

data_grouped = data.groupBy("ID", "score")
data_grouped .agg({"rank":"max"}).orderBy("ID").show()

I feel like this should be achievable relatively easily but my sql / pyspark knowledge is lacking.

Upvotes: 0

Views: 1666

Answers (2)

murtihash
murtihash

Reputation: 8410

Using a window will be more efficient and scalable, than using a groupBy. You can compute maximum value of rank in each ID grouping, using max function and window with partitionBy ID, and then filter on rank and computed maxrank.

w= Window().partitionBy("ID")
df.withColumn("maxrank", F.max("rank").over(w)).filter("rank=maxrank").drop("maxrank").show()

+---+----+-----+
| ID|rank|score|
+---+----+-----+
|  1|   2|   30|
|  2|   6|  100|
+---+----+-----+

Upvotes: 1

Jordan
Jordan

Reputation: 950

You probably need to use the partitionBy() method:

https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=partition#pyspark.RDD.partitionBy

I believe it does something similar to SQL's PARTITION BY:

https://www.sqltutorial.org/sql-window-functions/sql-partition-by/

Upvotes: 1

Related Questions