Reputation: 323
I am new to spark and have a use case to find the sum of all the values in a column. Each column is an array of integers.
df.show(2,false)
+------------------+
|value |
+------------------+
|[3,4,5] |
+------------------+
|[1,2] |
+------------------+
Value to find 3 + 4 + 5 + 1 + 2 = 15
Can someone please help/guide me on how to achieve this?
Edit: I have to run this code in spark 2.3
Upvotes: 1
Views: 2762
Reputation: 49260
One option is to sum up the array
on each row and then compute the overall sum. This can be done with Spark SQL function aggregate
available from Spark version 2.4.0.
val tmp = df.withColumn("summed_val",expr("aggregate(val,0,(acc, x) -> acc + x)"))
tmp.show()
+---+---------+----------+
| id| val|summed_val|
+---+---------+----------+
| 1|[3, 4, 5]| 12|
| 2| [1, 2]| 3|
+---+---------+----------+
//one row dataframe with the overall sum. collecting to a scalar value is possible too.
tmp.agg(sum("summed_val").alias("total")).show()
+-----+
|total|
+-----+
| 15|
+-----+
Another option is to use explode
. But beware this approach will generate a large amount of data to be aggregated on.
import org.apache.spark.sql.functions.explode
val tmp = df.withColumn("elem",explode($"val"))
tmp.agg(sum($"elem").alias("total")).show()
Upvotes: 3