Reputation: 772
After:
val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")
I have this DataFrame in Apache Spark:
+------+---------+
| Col1 | Col2 |
+------+---------+
| 1 |[2, 3, 4]|
| 1 |[2, 3, 4]|
+------+---------+
How do I convert this into:
+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| 1 | 2 | 3 | 4 |
| 1 | 2 | 3 | 4 |
+------+------+------+------+
Upvotes: 19
Views: 25684
Reputation: 676
If you are working with SparkR
, you can find my answer here where you don't need to use explode
but you need SparkR::dapply
and stringr::str_split_fixed
.
Upvotes: 0
Reputation: 328
Just to give the Pyspark version of sgvd's answer. If the array column is in Col2
, then this select statement will move the first nElements
of each array in Col2
to their own columns:
from pyspark.sql import functions as F
df.select([F.col('Col2').getItem(i) for i in range(nElements)])
Upvotes: 3
Reputation: 479
Just add on to sgvd's solution:
If the size is not always the same, you can set nElements like this:
val nElements = df.select(size('Col2).as("Col2_count"))
.select(max("Col2_count"))
.first.getInt(0)
Upvotes: 1
Reputation: 3939
A solution that doesn't convert to and from RDD:
df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")
Or arguable nicer:
val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))
The size of a Spark array column is not fixed, you could for instance have:
+----+------------+
|Col1| Col2|
+----+------------+
| 1| [2, 3, 4]|
| 1|[2, 3, 4, 5]|
+----+------------+
So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements
like this:
val nElements = df.select("Col2").first.getList(0).size
Upvotes: 21
Reputation: 2804
You can use a map:
df.map {
case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()
Upvotes: 0