Reputation: 135
Hi how's it going? I'm having an issue with excluding ties using window partition in scala. I want to collect the top 2 for each group excluding ties. So if there are three values, [5,5,3] for a particular group I want it to return 5 and 3, not 5 and 5.
This is my code so far:
val dummy_df = Seq(("yankees",5,4),("yankees",3,7),("yankees",5,2),("yankees",3,4),("red sox",7,3),("red sox",2,5),("red sox",2,4)).toDF("team","runs","opponent_runs")
val team_runs_partition= Window.partitionBy("team").orderBy($"runs".desc)
dummy_df.withColumn("rn",row_number.over(team_runs_partition)).filter($"rn"<=2).show
In tabular view the original dataframe looks like this:
and my current output dataframe looks like this:
However as you can see, due to the tie, it's not returning the top 2 unique "runs" for the Yankees but rather two instances of 5. I tried using the rank function but that becomes even worse with ties because if I have 5 ties at the top, the second value falls all the way to 6.
How would I solve this?
Thanks so much and have a great day!
Upvotes: 0
Views: 488
Reputation: 870
If you want to see the duplications (in case you have additional columns you're interested in), use dense_rank() instead of row_number() and then drop duplications.
val window = Window.partitionBy("team").orderBy(col("runs").desc)
data
.withColumn("rn", dense_rank() over window)
.filter(col("rn") <= 2)
... // TODO whatever ...
.dropDuplicates("team", "runs")
If you don't care about duplications (in 'team' and 'runs') drop duplicates before you use your window:
data
.dropDuplicates("team", "runs")
.withColumn("rn", row_number() over window)
.filter(col("rn") <= 2)
Upvotes: 1