Element-wise sum of arrays across multiple columns of a data frame in Spark / Scala?

Question

I have a Dataframe that can have multiple columns of Array type like "Array1", "Array2" ... etc. These array columns would have same number of elements. I need to compute a new column of Array type which will be the sum of arrays element wise. How can I do it ?

Spark version = 2.3

For Ex:

Input:

|Column1| ... |ArrayColumn2|ArrayColumn3|
|-------| --- |------------|------------|
|T1     | ... |[1, 2 , 3]  | [2, 5, 7]

Output:

|Column1| ... |AggregatedColumn|
|-------| --- |------------|
|T1.    | ... |[3, 7 , 10]

No of Array columns are not fixed, thus I need a generalized solution. I would have a list of columns for which I would need to aggregate.

Thanks !

Leo C · Accepted Answer

Consider using inline and higher-order function aggregate (available in Spark 2.4+) to compute element-wise sums from the Array-typed columns, followed by a groupBy/agg to group the element-wise sums back into Arrays:

val df = Seq(
  (101, Seq(1, 2), Seq(3, 4), Seq(5, 6)),
  (202, Seq(7, 8), Seq(9, 10), Seq(11, 12))
).toDF("id", "arr1", "arr2", "arr3")

val arrCols = df.columns.filter(_.startsWith("arr")).map(col)

For Spark 3.0+

df.
  withColumn("arr_structs", arrays_zip(arrCols: _*)).
  select($"id", expr("inline(arr_structs)")).
  select($"id", aggregate(array(arrCols: _*), lit(0), (acc, x) => acc + x).as("pos_elem_sum")).
  groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
  show
// +---+------------+
// | id|arr_elem_sum|
// +---+------------+
// |101|     [9, 12]|
// |202|    [27, 30]|
// +---+------------+

For Spark 2.4+

df.
  withColumn("arr_structs", arrays_zip(arrCols: _*)).
  select($"id", expr("inline(arr_structs)")).
  select($"id", array(arrCols: _*).as("arr_pos_elems")).
  select($"id", expr("aggregate(arr_pos_elems, 0, (acc, x) -> acc + x)").as("pos_elem_sum")).
  groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
  show

For Spark 2.3 or below

val sumArrElems = udf{ (arr: Seq[Int]) => arr.sum }

df.
  withColumn("arr_structs", arrays_zip(arrCols: _*)).
  select($"id", expr("inline(arr_structs)")).
  select($"id", sumArrElems(array(arrCols: _*)).as("pos_elem_sum")).
  groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
  show

Element-wise sum of arrays across multiple columns of a data frame in Spark / Scala?

Answers (2)

Related Questions