Spark SQL - Array of arrays to a single array

Question

I have two columns with arrays of strings

| ColA | ColB |
|------|------|
| ["a"]| ["b"]|

I wanted to create a single column containing values from both of the arrays"

| ColAplusB |
|-----------|
|["a", "b"] |

I tried array(ColA, ColB) which left me with:

| ColAplusBnested |
|-----------------|
| [["a"], ["b"]]  |

How could I get the desired result (array of arrays conversed to an array of values from initial arrays)?

matw · Accepted Answer

Let's suppose your data is like this:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("a"), Array("b")) 
)).toDF("ColA", "ColB")
df.printSchema()
df.show()

root
 |-- ColA: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ColB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+----+
|ColA|ColB|
+----+----+
| [a]| [b]|
+----+----+

The existing set of Spark SQL functions doesn't appear to have a concatenation function for arrays (or sequences). I only see concat functions for strings. But you can create a simple user-defined function (UDF):

import org.apache.spark.sql.functions.udf

val concatSeq = udf { (x: Seq[String], y: Seq[String]) => x ++ y }
val df2 = df.select(concatSeq('ColA, 'ColB).as("ColAplusB"))
df2.printSchema()
df2.show()

root
 |-- ColAplusB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+
|ColAplusB|
+---------+
|   [a, b]|
+---------+

Any extra logic you want to perform (e.g. sorting, removing duplicates) can be done in your UDF:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("b", "a", "c"), Array("a", "b")) 
)).toDF("ColA", "ColB")

df.show()

+---------+------+
|     ColA|  ColB|
+---------+------+
|[b, a, c]|[a, b]|
+---------+------+

val concatSeq = udf { (x: Seq[String], y: Seq[String]) =>
  (x ++ y).distinct.sorted
}

df.select(concatSeq('ColA, 'ColB).as("ColAplusB")).show()

+---------+
|ColAplusB|
+---------+
|[a, b, c]|
+---------+

Spark SQL - Array of arrays to a single array

Answers (2)

Related Questions