Reputation: 23
I have this df:
|User |country|
| Ron | italy|
| Tom | japan|
| Lin | spain|
| Tom | china|
| Tom | china|
| Lin | japan|
| Tom | china|
| Lin | japan|
I want to count for each user the total amount of his countries. for example, for the df above I'll get:
[Ron -> [italy ->1], Tom -> [Japan -> 1, china -> 3], Lin -> [Spain -> 1, Japan ->2]]
I started with
val groupedbyDf = df.groupBy("User")
But I don't know how to continue.. agg() ?
Upvotes: 1
Views: 56
Reputation: 42352
You need to create maps after group by, using relevant map functions:
val df2 = df.groupBy("User", "country")
.count()
.groupBy("User")
.agg(map(
col("User"),
map_from_entries(collect_list(struct(col("country"), col("count"))))
).as("result")
)
.select("result")
df2.show(false)
+---------------------------------+
|result |
+---------------------------------+
|[Tom -> [china -> 3, japan -> 1]]|
|[Lin -> [spain -> 1, japan -> 2]]|
|[Ron -> [italy -> 1]] |
+---------------------------------+
If you want all of them in one row, you can do one more aggregation:
val df2 = df.groupBy("User", "country")
.count()
.groupBy("user")
.agg(map_from_entries(collect_list(struct(col("country"), col("count")))).as("result"))
.agg(map_from_entries(collect_list(struct(col("user"), col("result")))).as("result_all"))
df2.show(false)
+---------------------------------------------------------------------------------------+
|result_all |
+---------------------------------------------------------------------------------------+
|[Tom -> [china -> 3, japan -> 1], Lin -> [spain -> 1, japan -> 2], Ron -> [italy -> 1]]|
+---------------------------------------------------------------------------------------+
Upvotes: 1