koh-ding
koh-ding

Reputation: 135

How to do a window partition and extract only unique values per group in Spark Scala

Hi how's it going? I'm having an issue with excluding ties using window partition in scala. I want to collect the top 2 for each group excluding ties. So if there are three values, [5,5,3] for a particular group I want it to return 5 and 3, not 5 and 5.

This is my code so far:

  val dummy_df = Seq(("yankees",5,4),("yankees",3,7),("yankees",5,2),("yankees",3,4),("red sox",7,3),("red sox",2,5),("red sox",2,4)).toDF("team","runs","opponent_runs")
  val team_runs_partition= Window.partitionBy("team").orderBy($"runs".desc)
  dummy_df.withColumn("rn",row_number.over(team_runs_partition)).filter($"rn"<=2).show

In tabular view the original dataframe looks like this:

enter image description here

and my current output dataframe looks like this:

enter image description here

However as you can see, due to the tie, it's not returning the top 2 unique "runs" for the Yankees but rather two instances of 5. I tried using the rank function but that becomes even worse with ties because if I have 5 ties at the top, the second value falls all the way to 6.

How would I solve this?

Thanks so much and have a great day!

Upvotes: 0

Views: 488

Answers (1)

Nir Hedvat
Nir Hedvat

Reputation: 870

If you want to see the duplications (in case you have additional columns you're interested in), use dense_rank() instead of row_number() and then drop duplications.

val window = Window.partitionBy("team").orderBy(col("runs").desc)
data
 .withColumn("rn", dense_rank() over window)
 .filter(col("rn") <= 2)
 ... // TODO whatever ...
 .dropDuplicates("team", "runs")

If you don't care about duplications (in 'team' and 'runs') drop duplicates before you use your window:

data
 .dropDuplicates("team", "runs")
 .withColumn("rn", row_number() over window)
 .filter(col("rn") <= 2)

Upvotes: 1

Related Questions