Reputation: 135

How to select the N highest values for each category in spark scala

Say I have this dataset:

  val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
    ("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")

which looks like this:

I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,

Since the 2 highest homerun totals for them were 8 and 6.

How would I do this in the general case?

Thanks

Upvotes: 0

Answers (1)

Krzysztof Atłasik

Reputation: 22595

Your problem is not really good fit for the pivot, since pivot means:

A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.

You could create an additional rank column with a window function and then select only rows with rank 1 or 2:

import org.apache.spark.sql.expressions.Window

main_df.withColumn(
  "rank", 
  rank()
  .over(
    Window.partitionBy("teams")
    .orderBy($"homeruns".desc)
  )
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show

+------------+--------+----+----+
|       teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets|       8|  20|   1|
|yankees-mets|       6|  17|   2|
+------------+--------+----+----+

Then if you no longer need rank column you could just drop it.

Upvotes: 4

How to select the N highest values for each category in spark scala

Answers (1)

Related Questions