vamsi
vamsi

Reputation: 323

Sum of all elements in a an array column

I am new to spark and have a use case to find the sum of all the values in a column. Each column is an array of integers.

df.show(2,false)

+------------------+
|value             |
+------------------+
|[3,4,5]           |
+------------------+
|[1,2]             |
+------------------+ 

Value to find 3 + 4 + 5 + 1 + 2 = 15

Can someone please help/guide me on how to achieve this?

Edit: I have to run this code in spark 2.3

Upvotes: 1

Views: 2762

Answers (1)

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

One option is to sum up the array on each row and then compute the overall sum. This can be done with Spark SQL function aggregate available from Spark version 2.4.0.

val tmp = df.withColumn("summed_val",expr("aggregate(val,0,(acc, x) -> acc + x)"))

tmp.show()
+---+---------+----------+
| id|      val|summed_val|
+---+---------+----------+
|  1|[3, 4, 5]|        12|
|  2|   [1, 2]|         3|
+---+---------+----------+

//one row dataframe with the overall sum. collecting to a scalar value is possible too.
tmp.agg(sum("summed_val").alias("total")).show() 
+-----+
|total|
+-----+
|   15|
+-----+

Another option is to use explode. But beware this approach will generate a large amount of data to be aggregated on.

import org.apache.spark.sql.functions.explode
val tmp = df.withColumn("elem",explode($"val"))
tmp.agg(sum($"elem").alias("total")).show()

Upvotes: 3

Related Questions