Jorge Machado
Jorge Machado

Reputation: 772

How to explode columns?

After:

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")

I have this DataFrame in Apache Spark:

+------+---------+
| Col1 | Col2    |
+------+---------+
|  1   |[2, 3, 4]|
|  1   |[2, 3, 4]|
+------+---------+

How do I convert this into:

+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
|  1   |  2   |  3   |  4   |
|  1   |  2   |  3   |  4   |
+------+------+------+------+

Upvotes: 19

Views: 25684

Answers (5)

Habib Karbasian
Habib Karbasian

Reputation: 676

If you are working with SparkR, you can find my answer here where you don't need to use explode but you need SparkR::dapply and stringr::str_split_fixed.

Upvotes: 0

Shane Halloran
Shane Halloran

Reputation: 328

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

from pyspark.sql import functions as F            
df.select([F.col('Col2').getItem(i) for i in range(nElements)])

Upvotes: 3

Yuan Zhao
Yuan Zhao

Reputation: 479

Just add on to sgvd's solution:

If the size is not always the same, you can set nElements like this:

val nElements = df.select(size('Col2).as("Col2_count"))
                  .select(max("Col2_count"))
                  .first.getInt(0)

Upvotes: 1

sgvd
sgvd

Reputation: 3939

A solution that doesn't convert to and from RDD:

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

Or arguable nicer:

val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

The size of a Spark array column is not fixed, you could for instance have:

+----+------------+
|Col1|        Col2|
+----+------------+
|   1|   [2, 3, 4]|
|   1|[2, 3, 4, 5]|
+----+------------+

So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

val nElements = df.select("Col2").first.getList(0).size

Upvotes: 21

Carlos Vilchez
Carlos Vilchez

Reputation: 2804

You can use a map:

df.map {
    case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()

Upvotes: 0

Related Questions