Leonti
Leonti

Reputation: 10950

Spark - how to get distinct values with their count

If I have a dataset like this:

name | food    | drink  | dollars
==================================
John | salad   | water  | 1
Dave | salad   | soda   | 2
John | burger  | water  | 5
John | burger  | soda   | 1

How do I get the following result in Spark (Scala):

name | food_count                | drink_count             | total_dollars
==========================================================================
John | [(salad, 1), (burger, 2)] | [(water, 2), (soda, 1)] | 7
Dave | [(salad, 1)]              | [(soda, 1)]             | 2

Not sure which aggregate function to apply after groupBy("name").
Do I need to write a UADF?
Feels like it's a common enough problem, so I was hoping there is a solution using built-in functions.

Upvotes: 0

Views: 380

Answers (1)

Mahesh Gupta
Mahesh Gupta

Reputation: 1892

In spark you can try with Multi-Dimensional Aggregation cube count and then use collect_list:

scala>  var df =Seq(("John" , "salad"   , "water", 1),("Dave" , "salad"   , "soda" , 2),("John" , "burger"  , "water", 5),("John" , "burger"  , "soda" , 1)).toDF("name","food","drink","dollar")

scala> var testing  = df.cube("name", "food","drink").count()
scala> var drinks_df =testing.filter(col("food").isNotNull).groupBy("name","drink").agg(struct(col("drink"),sum("count")).as("drink_count")).na.drop.groupBy("name").agg(collect_list("drink_count").as("drink_count")).join(df.groupBy("name").agg(sum("dollar").as("dollars_sum")),Seq("name"),"left").withColumnRenamed("name","name1")

scala> var fooddf1 = testing.filter(col("drink").isNotNull).groupBy("name","food").agg(struct(col("food"),sum("count")).as("food_count")).na.drop.groupBy("name").agg(collect_list("food_count").as("food_count"))

scala> fooddf1.join(drinks_df,col("name1")===col("name"),"left").drop("name1").show(false)
+----+-------------------------+-----------------------+-----------+            
|name|food_count               |drink_count            |dollars_sum|
+----+-------------------------+-----------------------+-----------+
|Dave|[[salad, 1]]             |[[soda, 1]]            |2          |
|John|[[salad, 1], [burger, 2]]|[[water, 2], [soda, 1]]|7          |
+----+-------------------------+-----------------------+-----------+

With the help of multi-dimension function, you can count a row with a different group

Upvotes: 2

Related Questions