Split spark DF column of list into individual columns

Question

Interested in the scala spark implementation of this split-column-of-list-into-multiple-columns-in-the-same-pyspark-dataframe

Given this Dataframe:

|              X     |            Y|
+--------------------+-------------+
|                rent|[1,2,3......]|
|     is_rent_changed|[4,5,6......]|
|               phone|[7,8,9......]|

I want A new Dataframe with exploded values and mapped to my provided col names:

colNames = ['cat','dog','mouse'....]

|              Column|cat |dog |mouse |.......|
+--------------------+---|---|--------|-------|
|                rent|1  |2  |3       |.......|
|     is_rent_changed|4  |5  |6       |.......|
|               phone|7  |8  |9       |.......|

Tried:

val out = df.select(col("X"),explode($"Y"))

But its wrong format and i dont know how to map to my colNames list:

X              | Y |
---------------|---|
rent           |1  |
rent           |2  |
rent           |3  |
.              |.  |
.              |.  |
is_rent_changed|4  |
is_rent_changed|5  |

In the link above, the python solution was to use a list comprehension:

univar_df10.select([univar_df10.Column] + [univar_df10.Quantile[i] for i in range(length)])

But it doesn't show how to use a provided column name list given the column names are just the index of the columns.

Vihit Shah · Accepted Answer

I have come up with a solution which is based on certain assumptions.

1. Each array row will have the same number of elements. 2. We have the column names in an Array[String] 3. We know the expected number of values in the Array to be split into columns.

var df1 = df

where df is your input DF. For eg:-

+---+---------+
| id|    value|
+---+---------+
|  1|[A, B, C]|
|  2|[D, E, F]|
+---+---------+

val columns = Array("cat","dog","sheep")

columns is the array of column names.

for(i<-1 to columns.length)
df1 = df1.withColumn(columns(i-1),'value.getItem(i-1))


+---+---------+---+---+-----+
| id|    value|cat|dog|sheep|
+---+---------+---+---+-----+
|  1|[A, B, C]|  A|  B|    C|
|  2|[D, E, F]|  D|  E|    F|
+---+---------+---+---+-----+

You can now drop the value column if you want. Hope this works for you!

Split spark DF column of list into individual columns

Answers (2)

Related Questions