Reputation: 1705
My dataframe is given below:
+----------------------------------+
| invoice_id | newcolor |
+------------+---------------------+
| 1 | [red, white, green] |
+------------+---------------------+
| 2 | [red, green] |
+------------+---------------------+
I need a new column with following:
[('red', 'color'), ('white', 'color), ('green','color)]
[('red', 'color'), ('green','color)]
Upvotes: 0
Views: 1946
Reputation: 41987
You can define a udf
function as
from pyspark.sql import functions as F
from pyspark.sql import types as T
def addColor(x):
return [[color, 'color'] for color in x]
udfAddColor = F.udf(addColor, T.ArrayType(T.StringType()))
and then use it with .withColumn
as
df.withColumn('newcolor', udfAddColor(df.newcolor)).show(truncate=False)
You should have your desired output as
+----------+----------------------------------------------+
|invoice_id|newcolor |
+----------+----------------------------------------------+
|1 |[[red, color], [white, color], [green, color]]|
|2 |[[red, color], [green, color]] |
+----------+----------------------------------------------+
Upvotes: 1