3yakuya
3yakuya

Reputation: 2672

Spark SQL - Array of arrays to a single array

I have two columns with arrays of strings

| ColA | ColB |
|------|------|
| ["a"]| ["b"]|

I wanted to create a single column containing values from both of the arrays"

| ColAplusB |
|-----------|
|["a", "b"] |

I tried array(ColA, ColB) which left me with:

| ColAplusBnested |
|-----------------|
| [["a"], ["b"]]  |

How could I get the desired result (array of arrays conversed to an array of values from initial arrays)?

Upvotes: 2

Views: 3725

Answers (2)

Stu
Stu

Reputation: 41

As of Spark 2.4.0, the array_union function has allowed for the concatenation of two arrays. Note that this will deduplicate any values that exist in both arrays.

If you want to combine multiple arrays together, with the arrays broken out across rows rather than columns, I use a two step process:

  1. Use explode_outer to unnest the arrays.
  2. Use collect_set to aggregate the values into a single deduplicated array. If you do not wish to deduplicate your results, use collect_list instead.

Upvotes: 0

matw
matw

Reputation: 81

Let's suppose your data is like this:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("a"), Array("b")) 
)).toDF("ColA", "ColB")
df.printSchema()
df.show()

root
 |-- ColA: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ColB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+----+
|ColA|ColB|
+----+----+
| [a]| [b]|
+----+----+

The existing set of Spark SQL functions doesn't appear to have a concatenation function for arrays (or sequences). I only see concat functions for strings. But you can create a simple user-defined function (UDF):

import org.apache.spark.sql.functions.udf

val concatSeq = udf { (x: Seq[String], y: Seq[String]) => x ++ y }
val df2 = df.select(concatSeq('ColA, 'ColB).as("ColAplusB"))
df2.printSchema()
df2.show()

root
 |-- ColAplusB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+
|ColAplusB|
+---------+
|   [a, b]|
+---------+

Any extra logic you want to perform (e.g. sorting, removing duplicates) can be done in your UDF:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("b", "a", "c"), Array("a", "b")) 
)).toDF("ColA", "ColB")

df.show()

+---------+------+
|     ColA|  ColB|
+---------+------+
|[b, a, c]|[a, b]|
+---------+------+

val concatSeq = udf { (x: Seq[String], y: Seq[String]) =>
  (x ++ y).distinct.sorted
}

df.select(concatSeq('ColA, 'ColB).as("ColAplusB")).show()

+---------+
|ColAplusB|
+---------+
|[a, b, c]|
+---------+

Upvotes: 2

Related Questions