Reputation: 45
Input
Column1 column2. column3
(1,2) (xyz,abc). (123,456)
Output should be
Column1 column2 column3
1. Xyz. 123
2. Abc. 456
I need to split the data in data frame. As 1st element of every column should come as one row.and 2nd second and so on element of that data would be splitted and coming as another row subsequently.
Upvotes: 0
Views: 91
Reputation: 6907
If you are using a recent version of Spark, arrays_zip
will help you do what you want:
// define test dataset
val df = spark.createDataset(List(("(1,2)","(xyz,abc)","(123,456)")))
.toDF("Column1","Column2","Column3")
df.show
+-------+---------+---------+
|Column1| Column2| Column3|
+-------+---------+---------+
| (1,2)|(xyz,abc)|(123,456)|
+-------+---------+---------+
With this dataset, you can split all delimited text values into arrays:
val reshape_cols = df.columns
.map(c => split(regexp_replace(col(c),"[()]",""),",").as(c))
val reshaped_df = df.select(reshape_cols:_*)
reshaped_df.show
+-------+----------+----------+
|Column1| Column2| Column3|
+-------+----------+----------+
| [1, 2]|[xyz, abc]|[123, 456]|
+-------+----------+----------+
Now that you have arrays, you can use arrays_zip
to generate a single column of type array of struct
val zipped_df = reshaped_df
.select(arrays_zip(reshaped_df.columns.map(col):_*).as("value"))
zipped_df.show(false)
+------------------------------+
|value |
+------------------------------+
|[[1, xyz, 123], [2, abc, 456]]|
+------------------------------+
Now that you have an array of struct, you can use explode to transform your single row into multiple rows:
val final_df = zipped_df
.select(explode('value).as("s"))
.select(df.columns.map(c => 's(c).as(c)):_*)
final_df.show
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| 1| xyz| 123|
| 2| abc| 456|
+-------+-------+-------+
Upvotes: 1