Umer
Umer

Reputation: 25

converting string type into rows in pyspark

I have the following data frame which has the following structure and values

root
|-- Row_number: integer (nullable = false)
 |-- Factor: string (nullable = false)
 |-- Country: string (nullable = false)
 |-- Date: date (nullable = false)
 |-- Amount: integer (nullable = false)

+----------+-----------------------+
|Row_number|              Factor  |
+----------+-----------------------+
|         1|[EN2_1, EN2_2, EN3_3] |
|         2|[EN2_1, EN2_2, EN3_3] |
|         3|[EN2_1, EN2_2, EN3_3] |
+----------+------------------------+

I want to convert into the following data frame

1, EN2_1

1, EN2_2

1, EN2_3

2, EN2_1

2, EN2_2

2, EN2_3

3, EN2_1

3, EN2_2

3, EN2_3

I tried to read the column-like ArrayType but it gives error

Upvotes: 0

Views: 354

Answers (2)

mck
mck

Reputation: 42422

You can first remove the square brackets using trim both, split using ', ', and explode the resulting array into rows:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'Factor',
    F.explode(F.split(F.expr("trim(both '[]' from Factor)"), ', '))
)

df2.show()
+----------+------+
|Row_number|Factor|
+----------+------+
|         1| EN2_1|
|         1| EN2_2|
|         1| EN3_3|
|         2| EN2_1|
|         2| EN2_2|
|         2| EN3_3|
|         3| EN2_1|
|         3| EN2_2|
|         3| EN3_3|
+----------+------+

Upvotes: 1

nobody
nobody

Reputation: 11090

A combination of split and explode should work

from pyspark.sql.functions import F

df.withColumn("New_Factor", F.explode(F.split(F.regexp_replace(F.col("Factor"), "(^\[)|(\]$)", ""), ", ")))

Upvotes: 2

Related Questions