Reputation: 772

How to explode columns?

After:

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")

I have this DataFrame in Apache Spark:

+------+---------+
| Col1 | Col2    |
+------+---------+
|  1   |[2, 3, 4]|
|  1   |[2, 3, 4]|
+------+---------+

How do I convert this into:

+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
|  1   |  2   |  3   |  4   |
|  1   |  2   |  3   |  4   |
+------+------+------+------+

Upvotes: 19

Answers (5)

Habib Karbasian

Reputation: 676

If you are working with SparkR, you can find my answer here where you don't need to use explode but you need SparkR::dapply and stringr::str_split_fixed.

Upvotes: 0

Shane Halloran

Reputation: 328

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

from pyspark.sql import functions as F            
df.select([F.col('Col2').getItem(i) for i in range(nElements)])

Upvotes: 3

Yuan Zhao

Reputation: 479

Just add on to sgvd's solution:

If the size is not always the same, you can set nElements like this:

val nElements = df.select(size('Col2).as("Col2_count"))
                  .select(max("Col2_count"))
                  .first.getInt(0)

Upvotes: 1

sgvd

Reputation: 3939

A solution that doesn't convert to and from RDD:

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

Or arguable nicer:

val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

The size of a Spark array column is not fixed, you could for instance have:

+----+------------+
|Col1|        Col2|
+----+------------+
|   1|   [2, 3, 4]|
|   1|[2, 3, 4, 5]|
+----+------------+

So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

val nElements = df.select("Col2").first.getList(0).size

Upvotes: 21

Carlos Vilchez

Reputation: 2804

You can use a map:

df.map {
    case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()

Upvotes: 0

How to explode columns?

Answers (5)

Related Questions