Select the value with most occurrence with groupBy

Question

Example Input:

Artist Skill
1. Bono Vocals
2. Bono Vocals
3. Bono Vocals
4. Bono Guitar
5. Edge Vocals
6. Edge Guitar
7. Edge Guitar
8. Edge Guitar
9. Edge     Bass
10. Larry   Drum
11. Larry   Drum
12. Larry   Guitar
13. Clayton Bass
14. Clayton Bass
15. Clayton Guitar

Corresponding Output

Artist Most Common skills

  1. Bono Vocals Edge Guitar Larry Drum Clayton Bass

I have a dataframe and i want to create a deterministic code using scala to generate a new DataFrame with exactly one row for each distinct "Artist" and the most common "Skills" for the corresponding Artist.

koiralo · Accepted Answer

You can combine groupBy and window functions as below

val window = Window.partitionBy("Artist").orderBy($"count".desc)
df.groupBy("Artist", "Skill")
  .agg(count("Skill").as("count")). // gives you count of artist and skill
  //select the first row with adding rownumber 
  .withColumn("rn", row_number over window).where($"rn" === 1 ) 
  .drop("rn", "count")
  .show(false)

Output:

+-------+------+
|Artist |Skill |
+-------+------+
|Clayton|Bass  |
|Larry  |Drum  |
|Edge   |Guitar|
|Bono   |Vocals|
+-------+------+

Select the value with most occurrence with groupBy

Answers (1)

Related Questions