Reputation: 1117
I have the below python list.
lst=['name','age','country
']
Spark dataframe is below.
column_a
name Xxxx, age 23, country aaaa
name yyyy, age 25, country bbbb
I have to compare the list with spark dataframe string column and remove the values from list from the column.
Expected output is:
column_a
Xxxx, 23, aaaa
yyyy, 25, bbbb
Upvotes: 1
Views: 3534
Reputation: 7923
Just in case you don't want to import any additional modules you can also use something like this:
df['column_a'] = df['column_a'].apply(lambda x: ''.join([i for i in x.split() if i not in lst]))
Upvotes: 2
Reputation: 13841
You can use regexp_replace
with '|'.join()
. The first is commonly used to replace substring matches. The latter will join the different elements of the list with |
. The combination of the two will remove any parts of your column that are present in your list.
import pyspark.sql.functions as F
df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))
Upvotes: 2