Reputation: 10950
If I have a dataset like this:
name | food | drink | dollars
==================================
John | salad | water | 1
Dave | salad | soda | 2
John | burger | water | 5
John | burger | soda | 1
How do I get the following result in Spark (Scala):
name | food_count | drink_count | total_dollars
==========================================================================
John | [(salad, 1), (burger, 2)] | [(water, 2), (soda, 1)] | 7
Dave | [(salad, 1)] | [(soda, 1)] | 2
Not sure which aggregate function to apply after groupBy("name")
.
Do I need to write a UADF?
Feels like it's a common enough problem, so I was hoping there is a solution using built-in functions.
Upvotes: 0
Views: 380
Reputation: 1892
In spark you can try with Multi-Dimensional Aggregation cube
count and then use collect_list
:
scala> var df =Seq(("John" , "salad" , "water", 1),("Dave" , "salad" , "soda" , 2),("John" , "burger" , "water", 5),("John" , "burger" , "soda" , 1)).toDF("name","food","drink","dollar")
scala> var testing = df.cube("name", "food","drink").count()
scala> var drinks_df =testing.filter(col("food").isNotNull).groupBy("name","drink").agg(struct(col("drink"),sum("count")).as("drink_count")).na.drop.groupBy("name").agg(collect_list("drink_count").as("drink_count")).join(df.groupBy("name").agg(sum("dollar").as("dollars_sum")),Seq("name"),"left").withColumnRenamed("name","name1")
scala> var fooddf1 = testing.filter(col("drink").isNotNull).groupBy("name","food").agg(struct(col("food"),sum("count")).as("food_count")).na.drop.groupBy("name").agg(collect_list("food_count").as("food_count"))
scala> fooddf1.join(drinks_df,col("name1")===col("name"),"left").drop("name1").show(false)
+----+-------------------------+-----------------------+-----------+
|name|food_count |drink_count |dollars_sum|
+----+-------------------------+-----------------------+-----------+
|Dave|[[salad, 1]] |[[soda, 1]] |2 |
|John|[[salad, 1], [burger, 2]]|[[water, 2], [soda, 1]]|7 |
+----+-------------------------+-----------------------+-----------+
With the help of multi-dimension function, you can count a row with a different group
Upvotes: 2