Whitewater
Whitewater

Reputation: 317

Filter a Dataframe on a column, if a list value is contained in the column value. Pandas

I am familiar with filtering a dataframe if the value of a column can be found in a list, but I can't figure out how to implement the reverse case, iterating through values in a list to see if those values are contained in the column value.

For example, my df has a column with values like:

df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})


     column
0       abc
1       def
2       ghi
3  abc, def
4  ghi, jkl
5       abc

Note that all values are strings, including values like 'abc, def'

I also have a list of two values I want to filter by: filter_list = ['abc', 'jkl'] so that I keep these values where they are found in the df column.

I want to filter the dataframe column if the value in the list is contained in the column, such that the final output in this case would be 'column' = ['abc', 'abc, def', 'ghi, jkl', 'abc']

     column
0       abc
1  abc, def
2  ghi, jkl
3       abc

I want to keep the rows where "abc"/"jkl" are contained.

My first thinking was using list comprehension in a lambda function, but that throws an error:

dataframe = dataframe[dataframe['column'].apply(labmda x: f for f in filter_list in x])]

Note I cannot use the .isin() method because that wont capture the complex strings like 'abc, def'

Looking for thought's on how to do this. I cannot use the re package. I only have access to pandas/numpy.

UPDATE:

Big thank you to Balaji! The working solution:

keeping list values:

dataframe[pd.notna(dataframe[column]) & dataframe[column].astype(str).str.contains('|'.join(filter_list))]

filtering out:

dataframe[pd.notna(dataframe[column]) & ~dataframe[column].astype(str).str.contains('|'.join(filter_list))]

Upvotes: 2

Views: 6266

Answers (2)

Balaji Ambresh
Balaji Ambresh

Reputation: 5012

Here you go:

df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})
contains_filter = '|'.join(filter_list)
df = df[pd.notna(df.column) & df.column.str.contains(contains_filter)]

Output:

     column
0       abc
3  abc, def
4  ghi, jkl
5       abc

Upvotes: 4

Umar.H
Umar.H

Reputation: 23099

str,contains is the usual way to go, but lets look at an alternate method. We can use .stack and isin if your string column is separated by commas.

filter_list = ['abc', 'jkl']

s = df['column'].str.split(',',expand=True).stack().str.strip()

df_filter = df.loc[s[s.isin(filter_list)].index.get_level_values(0)]

print(df_filter)

     column
0       abc
3  abc, def
4  ghi, jkl
5       abc

Upvotes: 1

Related Questions