user7652554
user7652554

Reputation:

Sum of single column across rows based on a condition in Spark Dataframe

Consider the following dataframe:

+-------+-----------+-------+
|    rid|  createdon|  count|
+-------+-----------+-------+
|    124| 2017-06-15|     1 |
|    123| 2017-06-14|     2 |
|    123| 2017-06-14|     1 |
+-------+-----------+-------+

I need to add the count column among rows which has createdon and rid of are same.

Therefore the resultant dataframe should be follows:

+-------+-----------+-------+
|    rid|  createdon|  count|
+-------+-----------+-------+
|    124| 2017-06-15|     1 |
|    123| 2017-06-14|     3 |
+-------+-----------+-------+

I am using Spark 2.0.2.

I have tried agg, conditions inside select etc, but couldn't find the solution. Can anyone help me?

Upvotes: 0

Views: 1512

Answers (2)

Raphael Roth
Raphael Roth

Reputation: 27373

this should do what you want:

import org.apache.spark.sql.functions.sum

df
.groupBy($"rid",$"createdon")
.agg(sum($"count").as("count"))
.show

Upvotes: 0

Mikel San Vicente
Mikel San Vicente

Reputation: 3863

Try this

import org.apache.spark.sql.{functions => func}
df.groupBy($"rid", $"createdon").agg(func.sum($"count").alias("count"))

Upvotes: 1

Related Questions