Reputation: 25
I have the following data frame which has the following structure and values
root
|-- Row_number: integer (nullable = false)
|-- Factor: string (nullable = false)
|-- Country: string (nullable = false)
|-- Date: date (nullable = false)
|-- Amount: integer (nullable = false)
+----------+-----------------------+
|Row_number| Factor |
+----------+-----------------------+
| 1|[EN2_1, EN2_2, EN3_3] |
| 2|[EN2_1, EN2_2, EN3_3] |
| 3|[EN2_1, EN2_2, EN3_3] |
+----------+------------------------+
I want to convert into the following data frame
1, EN2_1
1, EN2_2
1, EN2_3
2, EN2_1
2, EN2_2
2, EN2_3
3, EN2_1
3, EN2_2
3, EN2_3
I tried to read the column-like ArrayType but it gives error
Upvotes: 0
Views: 354
Reputation: 42422
You can first remove the square brackets using trim both
, split using ', '
, and explode the resulting array into rows:
import pyspark.sql.functions as F
df2 = df.withColumn(
'Factor',
F.explode(F.split(F.expr("trim(both '[]' from Factor)"), ', '))
)
df2.show()
+----------+------+
|Row_number|Factor|
+----------+------+
| 1| EN2_1|
| 1| EN2_2|
| 1| EN3_3|
| 2| EN2_1|
| 2| EN2_2|
| 2| EN3_3|
| 3| EN2_1|
| 3| EN2_2|
| 3| EN3_3|
+----------+------+
Upvotes: 1
Reputation: 11090
A combination of split
and explode
should work
from pyspark.sql.functions import F
df.withColumn("New_Factor", F.explode(F.split(F.regexp_replace(F.col("Factor"), "(^\[)|(\]$)", ""), ", ")))
Upvotes: 2