Reputation: 2672
I have two columns with arrays of strings
| ColA | ColB |
|------|------|
| ["a"]| ["b"]|
I wanted to create a single column containing values from both of the arrays"
| ColAplusB |
|-----------|
|["a", "b"] |
I tried array(ColA, ColB)
which left me with:
| ColAplusBnested |
|-----------------|
| [["a"], ["b"]] |
How could I get the desired result (array of arrays conversed to an array of values from initial arrays)?
Upvotes: 2
Views: 3725
Reputation: 41
As of Spark 2.4.0, the array_union function has allowed for the concatenation of two arrays. Note that this will deduplicate any values that exist in both arrays.
If you want to combine multiple arrays together, with the arrays broken out across rows rather than columns, I use a two step process:
Upvotes: 0
Reputation: 81
Let's suppose your data is like this:
val df = spark.sqlContext.createDataFrame(Seq(
(Array("a"), Array("b"))
)).toDF("ColA", "ColB")
df.printSchema()
df.show()
root
|-- ColA: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ColB: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----+
|ColA|ColB|
+----+----+
| [a]| [b]|
+----+----+
The existing set of Spark SQL functions doesn't appear to have a concatenation function for arrays (or sequences). I only see concat
functions for strings. But you can create a simple user-defined function (UDF):
import org.apache.spark.sql.functions.udf
val concatSeq = udf { (x: Seq[String], y: Seq[String]) => x ++ y }
val df2 = df.select(concatSeq('ColA, 'ColB).as("ColAplusB"))
df2.printSchema()
df2.show()
root
|-- ColAplusB: array (nullable = true)
| |-- element: string (containsNull = true)
+---------+
|ColAplusB|
+---------+
| [a, b]|
+---------+
Any extra logic you want to perform (e.g. sorting, removing duplicates) can be done in your UDF:
val df = spark.sqlContext.createDataFrame(Seq(
(Array("b", "a", "c"), Array("a", "b"))
)).toDF("ColA", "ColB")
df.show()
+---------+------+
| ColA| ColB|
+---------+------+
|[b, a, c]|[a, b]|
+---------+------+
val concatSeq = udf { (x: Seq[String], y: Seq[String]) =>
(x ++ y).distinct.sorted
}
df.select(concatSeq('ColA, 'ColB).as("ColAplusB")).show()
+---------+
|ColAplusB|
+---------+
|[a, b, c]|
+---------+
Upvotes: 2