How to remove specific strings from a list in pyspark dataframe column

Question

I have the below python list.

lst=['name','age','country']

Spark dataframe is below.

column_a
name Xxxx, age 23, country aaaa
name yyyy, age 25, country bbbb

I have to compare the list with spark dataframe string column and remove the values from list from the column.

Expected output is:

column_a
Xxxx, 23, aaaa
yyyy, 25, bbbb

sophocles · Accepted Answer

You can use regexp_replace with '|'.join(). The first is commonly used to replace substring matches. The latter will join the different elements of the list with |. The combination of the two will remove any parts of your column that are present in your list.

import pyspark.sql.functions as F

df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))

How to remove specific strings from a list in pyspark dataframe column

Answers (2)

Related Questions