jxn
jxn

Reputation: 8055

Split spark DF column of list into individual columns

Interested in the scala spark implementation of this split-column-of-list-into-multiple-columns-in-the-same-pyspark-dataframe

Given this Dataframe:

|              X     |            Y|
+--------------------+-------------+
|                rent|[1,2,3......]|
|     is_rent_changed|[4,5,6......]|
|               phone|[7,8,9......]|

I want A new Dataframe with exploded values and mapped to my provided col names:

colNames = ['cat','dog','mouse'....]

|              Column|cat |dog |mouse |.......|
+--------------------+---|---|--------|-------|
|                rent|1  |2  |3       |.......|
|     is_rent_changed|4  |5  |6       |.......|
|               phone|7  |8  |9       |.......|

Tried:

val out = df.select(col("X"),explode($"Y"))

But its wrong format and i dont know how to map to my colNames list:

X              | Y |
---------------|---|
rent           |1  |
rent           |2  |
rent           |3  |
.              |.  |
.              |.  |
is_rent_changed|4  |
is_rent_changed|5  |

In the link above, the python solution was to use a list comprehension:

univar_df10.select([univar_df10.Column] + [univar_df10.Quantile[i] for i in range(length)])

But it doesn't show how to use a provided column name list given the column names are just the index of the columns.

Upvotes: 1

Views: 1865

Answers (2)

Krzysztof Atłasik
Krzysztof Atłasik

Reputation: 22635

Actually it's not as complicated.What dataframe's select expects, is just list of Column object and we can create that list programatically:

val columns = List("cat", "mouse", "dog")
   .zipWithIndex
   .map{ case (header, idx) => $"y"(idx).alias(header) }

Then only thing we need to do is just use it in select:

val df = List(("rent", Array(1,2,3)), ("is_rent_changed", Array(3,4,5))).toDF("x", "y")
df.select($"x" :: columns : _*).show()

Output:

+---------------+---+-----+---+
|              x|cat|mouse|dog|
+---------------+---+-----+---+
|           rent|  1|    2|  3|
|is_rent_changed|  3|    4|  5|
+---------------+---+-----+---+

Of course you could also inline it.

Upvotes: 1

Vihit Shah
Vihit Shah

Reputation: 314

I have come up with a solution which is based on certain assumptions.

1. Each array row will have the same number of elements. 2. We have the column names in an Array[String] 3. We know the expected number of values in the Array to be split into columns.
var df1 = df

where df is your input DF. For eg:-

+---+---------+
| id|    value|
+---+---------+
|  1|[A, B, C]|
|  2|[D, E, F]|
+---+---------+

val columns = Array("cat","dog","sheep")

columns is the array of column names.

for(i<-1 to columns.length)
df1 = df1.withColumn(columns(i-1),'value.getItem(i-1))


+---+---------+---+---+-----+
| id|    value|cat|dog|sheep|
+---+---------+---+---+-----+
|  1|[A, B, C]|  A|  B|    C|
|  2|[D, E, F]|  D|  E|    F|
+---+---------+---+---+-----+

You can now drop the value column if you want. Hope this works for you!

Upvotes: 1

Related Questions