Niklas Brauer
Niklas Brauer

Reputation: 77

pyspark: create column based on string contained in another column

How can one reduce noise in a column by extracting a certain string using Pyspark. Please check the table below. Instead of having 2 categories only, additional text (in duration) screws up any grouping. The column duration1 created by the UDF below is supposed to solve this issue, but an operator as like "value.contains()", "Like" or "in" is missing.

duration|duration1|
Full day|Full day|
Full day x|other|
Half-day|Half day|
Half-day morning|other|

def duration_simple(value):
   if   value == "Full day": return 'Full day'
   elif value == "Half-day": return 'Half day'
   else: return 'other'

udfduration_simple = udf(duration_simple, StringType())

new_df= old_df.withColumn("duration1", udfduration_simple("duration"))

Upvotes: 2

Views: 1650

Answers (1)

zlidime
zlidime

Reputation: 1224

you can use like() function, similar to SQL

from pyspark.sql import functions as F
new_df= df.select( df.duration, F.when(df.duration.like("%Full day%"),"Full day").when(df.duration.like("%Half-day%"),"Half day").otherwise("other").alias("duration1")).show()

Upvotes: 3

Related Questions