converting string type into rows in pyspark

Question

I have the following data frame which has the following structure and values

root
|-- Row_number: integer (nullable = false)
 |-- Factor: string (nullable = false)
 |-- Country: string (nullable = false)
 |-- Date: date (nullable = false)
 |-- Amount: integer (nullable = false)

+----------+-----------------------+
|Row_number|              Factor  |
+----------+-----------------------+
|         1|[EN2_1, EN2_2, EN3_3] |
|         2|[EN2_1, EN2_2, EN3_3] |
|         3|[EN2_1, EN2_2, EN3_3] |
+----------+------------------------+

I want to convert into the following data frame

1, EN2_1

1, EN2_2

1, EN2_3

2, EN2_1

2, EN2_2

2, EN2_3

3, EN2_1

3, EN2_2

3, EN2_3

I tried to read the column-like ArrayType but it gives error

nobody · Accepted Answer

A combination of split and explode should work

from pyspark.sql.functions import F

df.withColumn("New_Factor", F.explode(F.split(F.regexp_replace(F.col("Factor"), "(^$$)|($$$)", ""), ", ")))

converting string type into rows in pyspark

Answers (2)

Related Questions