Sky Monster
Sky Monster

Reputation: 53

Pyspark - create a new column using startswith from list

What is the best possible way to add a new column based on string checking condition

have to create a new column using existing column value if it startswith some defined values:

|deliveryname|department|state|salary|
+-------------+----------+-----+------+
|          LA|     Sales|   NY| 90000|
|      Austin|     Sales|   NY| 86000|
|      Robert|     Sales|   CA| 81000|
|     Snooze |   Finance|   CA| 90000|
|     MidWest|   Finance|   NY| 83000|
|        Jeff| Marketing|   CA| 80000|

df= df.withColumn("DeliveryPossible",when(df.deliveryname.startswith(s) for s in (('LO - ','Austin','MidWest','San Antonios', 'Snooze ea')),'True').otherwise('False'))

or

values = ['LO - ','Austin','MidWest','San Antonios', 'Snooze ea']

df.withColumn("DeliveryPossible",when(df.company_name.startswith(s) for s in values ,'True').otherwise('False')).show()

Required OUTPUT:

|deliveryname|department|state|salary|DeliveryPossible
+-------------+----------+-----+------+
|          LA|     Sales|   NY| 90000|False
|      Austin|     Sales|   NY| 86000|True
|      Robert|     Sales|   CA| 81000|False
|     Snooze |   Finance|   CA| 90000|True
|     MidWest|   Finance|   NY| 83000|True
|        Jeff| Marketing|   CA| 80000|False

And I'm getting same error in both, I figured I'm missing parenthesis but not able to figure out where to put. And also is this correct way of doing this?

Generator expression must be parenthesized if not sole argument.

Thanks

Upvotes: 1

Views: 1333

Answers (1)

mck
mck

Reputation: 42422

df.startswith() only accepts one string as its argument. You need to set up the conditions separately and combine them using 'OR'.

from functools import reduce
from operator import or_

values = ['LO - ','Austin','MidWest','San Antonios', 'Snooze ea']

df.withColumn("DeliveryPossible",
              reduce(or_, [df.company_name.startswith(s) for s in values])
             ).show()

Upvotes: 1

Related Questions