pyspark - convert collected list to tuple

Question

My dataframe is given below:

+----------------------------------+
| invoice_id | newcolor            |
+------------+---------------------+
|         1  | [red, white, green] | 
+------------+---------------------+
|         2  | [red, green]        |       
+------------+---------------------+

I need a new column with following:

[('red', 'color'), ('white', 'color), ('green','color)]
[('red', 'color'), ('green','color)]

Ramesh Maharjan · Accepted Answer

You can define a udf function as

from pyspark.sql import functions as F
from pyspark.sql import types as T
def addColor(x):
    return [[color, 'color'] for color in x]

udfAddColor = F.udf(addColor, T.ArrayType(T.StringType()))

and then use it with .withColumn as

df.withColumn('newcolor', udfAddColor(df.newcolor)).show(truncate=False)

You should have your desired output as

+----------+----------------------------------------------+
|invoice_id|newcolor                                      |
+----------+----------------------------------------------+
|1         |[[red, color], [white, color], [green, color]]|
|2         |[[red, color], [green, color]]                |
+----------+----------------------------------------------+

pyspark - convert collected list to tuple

Answers (1)

Related Questions