How to split rows in a dataframe to multiple rows based on delimiter

Question

Input

Column1     column2.        column3
(1,2)       (xyz,abc).      (123,456)

Output should be

Column1  column2     column3
1.       Xyz.        123
2.       Abc.        456

I need to split the data in data frame. As 1st element of every column should come as one row.and 2nd second and so on element of that data would be splitted and coming as another row subsequently.

rluta · Accepted Answer

If you are using a recent version of Spark, arrays_zip will help you do what you want:

// define test dataset
val df = spark.createDataset(List(("(1,2)","(xyz,abc)","(123,456)")))
    .toDF("Column1","Column2","Column3") 

df.show
+-------+---------+---------+
|Column1|  Column2|  Column3|
+-------+---------+---------+
|  (1,2)|(xyz,abc)|(123,456)|
+-------+---------+---------+

With this dataset, you can split all delimited text values into arrays:

val reshape_cols = df.columns
    .map(c => split(regexp_replace(col(c),"[()]",""),",").as(c))
val reshaped_df = df.select(reshape_cols:_*)

reshaped_df.show
+-------+----------+----------+
|Column1|   Column2|   Column3|
+-------+----------+----------+
| [1, 2]|[xyz, abc]|[123, 456]|
+-------+----------+----------+

Now that you have arrays, you can use arrays_zip to generate a single column of type array of struct

val zipped_df = reshaped_df
    .select(arrays_zip(reshaped_df.columns.map(col):_*).as("value")) 

zipped_df.show(false)
+------------------------------+
|value                         |
+------------------------------+
|[[1, xyz, 123], [2, abc, 456]]|
+------------------------------+

Now that you have an array of struct, you can use explode to transform your single row into multiple rows:

val final_df = zipped_df
    .select(explode('value).as("s"))
    .select(df.columns.map(c => 's(c).as(c)):_*)

final_df.show
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
|      1|    xyz|    123|
|      2|    abc|    456|
+-------+-------+-------+

How to split rows in a dataframe to multiple rows based on delimiter

Answers (1)

Related Questions