koh-ding
koh-ding

Reputation: 135

How to select the N highest values for each category in spark scala

Say I have this dataset:

  val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
    ("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")

which looks like this:

enter image description here

I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,

enter image description here

Since the 2 highest homerun totals for them were 8 and 6.

How would I do this in the general case?

Thanks

Upvotes: 0

Views: 326

Answers (1)

Krzysztof Atłasik
Krzysztof Atłasik

Reputation: 22595

Your problem is not really good fit for the pivot, since pivot means:

A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.

You could create an additional rank column with a window function and then select only rows with rank 1 or 2:

import org.apache.spark.sql.expressions.Window

main_df.withColumn(
  "rank", 
  rank()
  .over(
    Window.partitionBy("teams")
    .orderBy($"homeruns".desc)
  )
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
|       teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets|       8|  20|   1|
|yankees-mets|       6|  17|   2|
+------------+--------+----+----+

Then if you no longer need rank column you could just drop it.

Upvotes: 4

Related Questions