Padfoot123
Padfoot123

Reputation: 1117

How to remove specific strings from a list in pyspark dataframe column

I have the below python list.

lst=['name','age','country']

Spark dataframe is below.

column_a
name Xxxx, age 23, country aaaa
name yyyy, age 25, country bbbb

I have to compare the list with spark dataframe string column and remove the values from list from the column.

Expected output is:

column_a
Xxxx, 23, aaaa
yyyy, 25, bbbb

Upvotes: 1

Views: 3534

Answers (2)

Rabinzel
Rabinzel

Reputation: 7923

Just in case you don't want to import any additional modules you can also use something like this:

df['column_a'] = df['column_a'].apply(lambda x: ''.join([i for i in x.split() if i not in lst]))

Upvotes: 2

sophocles
sophocles

Reputation: 13841

You can use regexp_replace with '|'.join(). The first is commonly used to replace substring matches. The latter will join the different elements of the list with |. The combination of the two will remove any parts of your column that are present in your list.

import pyspark.sql.functions as F

df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))

Upvotes: 2

Related Questions