user2805885
user2805885

Reputation: 1705

pyspark - convert collected list to tuple

My dataframe is given below:

+----------------------------------+
| invoice_id | newcolor            |
+------------+---------------------+
|         1  | [red, white, green] | 
+------------+---------------------+
|         2  | [red, green]        |       
+------------+---------------------+

I need a new column with following:

[('red', 'color'), ('white', 'color), ('green','color)]
[('red', 'color'), ('green','color)]

Upvotes: 0

Views: 1946

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

You can define a udf function as

from pyspark.sql import functions as F
from pyspark.sql import types as T
def addColor(x):
    return [[color, 'color'] for color in x]

udfAddColor = F.udf(addColor, T.ArrayType(T.StringType()))

and then use it with .withColumn as

df.withColumn('newcolor', udfAddColor(df.newcolor)).show(truncate=False)

You should have your desired output as

+----------+----------------------------------------------+
|invoice_id|newcolor                                      |
+----------+----------------------------------------------+
|1         |[[red, color], [white, color], [green, color]]|
|2         |[[red, color], [green, color]]                |
+----------+----------------------------------------------+

Upvotes: 1

Related Questions