mike_di
mike_di

Reputation: 3

Select the value with most occurrence with groupBy

Example Input:

Artist Skill
1. Bono Vocals
2. Bono Vocals
3. Bono Vocals
4. Bono Guitar
5. Edge Vocals
6. Edge Guitar
7. Edge Guitar
8. Edge Guitar
9. Edge     Bass
10. Larry   Drum
11. Larry   Drum
12. Larry   Guitar
13. Clayton Bass
14. Clayton Bass
15. Clayton Guitar

Corresponding Output

Artist Most Common skills

  1. Bono Vocals Edge Guitar Larry Drum Clayton Bass

I have a dataframe and i want to create a deterministic code using scala to generate a new DataFrame with exactly one row for each distinct "Artist" and the most common "Skills" for the corresponding Artist.

Upvotes: 0

Views: 101

Answers (1)

koiralo
koiralo

Reputation: 23109

You can combine groupBy and window functions as below

val window = Window.partitionBy("Artist").orderBy($"count".desc)
df.groupBy("Artist", "Skill")
  .agg(count("Skill").as("count")). // gives you count of artist and skill
  //select the first row with adding rownumber 
  .withColumn("rn", row_number over window).where($"rn" === 1 ) 
  .drop("rn", "count")
  .show(false)

Output:

+-------+------+
|Artist |Skill |
+-------+------+
|Clayton|Bass  |
|Larry  |Drum  |
|Edge   |Guitar|
|Bono   |Vocals|
+-------+------+

Upvotes: 1

Related Questions