LifeLongStudent
LifeLongStudent

Reputation: 2478

Spark groupby aggregations

I am trying to do group by aggregations. Using Spark 1.5.2

Can you please tell why this is not working.

in is a dataframe.

scala> in
res28: org.apache.spark.sql.DataFrame = [id: int, city: string]




scala> in.show
+---+--------+
| id|    city|
+---+--------+
| 10|Bathinda|
| 20|Amritsar|
| 30|Bathinda|
+---+--------+

scala>in.groupBy("city").agg(Map{
 |       "id" -> "sum"
 |     }).show(true)
+----+-------+
|city|sum(id)|
+----+-------+
+----+-------+

Thanks,

I expect that output should be having cities and sum of id

EDIT: I do not know why it worked next time when i created new spark-shell

Upvotes: 0

Views: 1058

Answers (1)

eliasah
eliasah

Reputation: 40380

Considering the following DataFrame :

val in = sc.parallelize(Seq(
  (10, "Bathinda"), (20, "Amritsar"), (30, "Bathinda"))).toDF("id", "city")

You can see that these code lines will give the same output

scala> in.groupBy("city").agg(Map("id" -> "sum")).show
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(Map{ "id" -> "sum"}).show
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(Map{ "id" -> "sum"}).show(true)
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(sum($"id")).show(true)
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

scala> in.groupBy("city").agg(sum(in("id"))).show(true)
+--------+-------+
|    city|sum(id)|
+--------+-------+
|Bathinda|     40|
|Amritsar|     20|
+--------+-------+

Note: The show argument is false by default and it concerns just whether to show the whole field value or not. (Sometime the field is too long and you just need a preview)

Upvotes: 2

Related Questions