rpb
rpb

Reputation: 3299

Unable to filter pandas dataframe based on set of substrings

The objective is to find whether each cell in a column contain either one of the string from a set of strings.

Given a df below;

df = pd.DataFrame({'Date': ['1 December 2020', 'May 2020', 'Random value', 'may 2019']})

and a flag sub string of

stk_list = ['December','May']

It should return

True; True; False; True

However, using the code below

stk_list = ['December','May']
df = pd.DataFrame({'Date': ['1 December 2020', 'May 2020', 'Random value', 'may 2019']})
dfxx=df['Date'].isin(stk_list)

it return

False; False;False;False

May I know where did I do wrong?

Upvotes: 1

Views: 41

Answers (2)

mmdanziger
mmdanziger

Reputation: 4658

You can use apply for this:

dfxx=df['Date'].apply(lambda x: any([i in x for i in stk_list])

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150735

You can use regex for this purpose:

pattern = '|'.join(s.lower() for s in stk_list)

df.Date.str.lower().str.contains(pattern)

You may want to modify pattern so that it only checks for the whole words:

pattern = r"\b({})\b".format('|'.join(s.lower() for s in stk_list))

Here \b denotes word boudary, so this pattern would not flag maybe as True.

Upvotes: 1

Related Questions