Reputation: 3
Example Input:
Artist Skill
1. Bono Vocals
2. Bono Vocals
3. Bono Vocals
4. Bono Guitar
5. Edge Vocals
6. Edge Guitar
7. Edge Guitar
8. Edge Guitar
9. Edge Bass
10. Larry Drum
11. Larry Drum
12. Larry Guitar
13. Clayton Bass
14. Clayton Bass
15. Clayton Guitar
Corresponding Output
Artist Most Common skills
1. Bono Vocals Edge Guitar Larry Drum Clayton Bass
I have a dataframe and i want to create a deterministic code using scala to generate a new DataFrame with exactly one row for each distinct "Artist" and the most common "Skills" for the corresponding Artist.
Upvotes: 0
Views: 101
Reputation: 23109
You can combine groupBy
and window
functions as below
val window = Window.partitionBy("Artist").orderBy($"count".desc)
df.groupBy("Artist", "Skill")
.agg(count("Skill").as("count")). // gives you count of artist and skill
//select the first row with adding rownumber
.withColumn("rn", row_number over window).where($"rn" === 1 )
.drop("rn", "count")
.show(false)
Output:
+-------+------+
|Artist |Skill |
+-------+------+
|Clayton|Bass |
|Larry |Drum |
|Edge |Guitar|
|Bono |Vocals|
+-------+------+
Upvotes: 1