Reputation: 107
I'm new to Pyspark and trying to solve an ETL step.
I have the following schema below. I would like to take the variable that is inside the array and transform it into a column, but when doing this with explode I create duplicate rows because there are positions [0], [1], and [2] inside the element.
My goal is to transform what is inside variable into a new column taking everything that is in the element (separating by comma what was in each element) and transforming it into a string.
root
|-- id: string (nullable = true)
|-- info: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- variable: string (nullable = true)
Output:
id | new column |
---|---|
123435e5x-9a9z | A, B, D |
555585a4Z-0B1Y | A |
Thank you for the help
Upvotes: 1
Views: 174
Reputation: 350
As mentioned by David Markovitz you can use the concat_ws function as below:
from pyspark.sql import functions as F
(df.withColumn('new column', F.concat_ws(', ', F.col('info'))
Upvotes: 1