Reputation: 8055
Interested in the scala spark implementation of this split-column-of-list-into-multiple-columns-in-the-same-pyspark-dataframe
Given this Dataframe:
| X | Y|
+--------------------+-------------+
| rent|[1,2,3......]|
| is_rent_changed|[4,5,6......]|
| phone|[7,8,9......]|
I want A new Dataframe with exploded values and mapped to my provided col names:
colNames = ['cat','dog','mouse'....]
| Column|cat |dog |mouse |.......|
+--------------------+---|---|--------|-------|
| rent|1 |2 |3 |.......|
| is_rent_changed|4 |5 |6 |.......|
| phone|7 |8 |9 |.......|
Tried:
val out = df.select(col("X"),explode($"Y"))
But its wrong format and i dont know how to map to my colNames list:
X | Y |
---------------|---|
rent |1 |
rent |2 |
rent |3 |
. |. |
. |. |
is_rent_changed|4 |
is_rent_changed|5 |
In the link above, the python solution was to use a list comprehension:
univar_df10.select([univar_df10.Column] + [univar_df10.Quantile[i] for i in range(length)])
But it doesn't show how to use a provided column name list given the column names are just the index of the columns.
Upvotes: 1
Views: 1865
Reputation: 22635
Actually it's not as complicated.What dataframe's select
expects, is just list of Column
object and we can create that list programatically:
val columns = List("cat", "mouse", "dog")
.zipWithIndex
.map{ case (header, idx) => $"y"(idx).alias(header) }
Then only thing we need to do is just use it in select
:
val df = List(("rent", Array(1,2,3)), ("is_rent_changed", Array(3,4,5))).toDF("x", "y")
df.select($"x" :: columns : _*).show()
Output:
+---------------+---+-----+---+
| x|cat|mouse|dog|
+---------------+---+-----+---+
| rent| 1| 2| 3|
|is_rent_changed| 3| 4| 5|
+---------------+---+-----+---+
Of course you could also inline it.
Upvotes: 1
Reputation: 314
I have come up with a solution which is based on certain assumptions.
1. Each array row will have the same number of elements. 2. We have the column names in an Array[String] 3. We know the expected number of values in the Array to be split into columns.var df1 = df
where df
is your input DF. For eg:-
+---+---------+
| id| value|
+---+---------+
| 1|[A, B, C]|
| 2|[D, E, F]|
+---+---------+
val columns = Array("cat","dog","sheep")
columns
is the array of column names.
for(i<-1 to columns.length)
df1 = df1.withColumn(columns(i-1),'value.getItem(i-1))
+---+---------+---+---+-----+
| id| value|cat|dog|sheep|
+---+---------+---+---+-----+
| 1|[A, B, C]| A| B| C|
| 2|[D, E, F]| D| E| F|
+---+---------+---+---+-----+
You can now drop the value column if you want. Hope this works for you!
Upvotes: 1