Reputation: 317
I am familiar with filtering a dataframe if the value of a column can be found in a list, but I can't figure out how to implement the reverse case, iterating through values in a list to see if those values are contained in the column value.
For example, my df has a column with values like:
df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})
column
0 abc
1 def
2 ghi
3 abc, def
4 ghi, jkl
5 abc
Note that all values are strings, including values like 'abc, def'
I also have a list of two values I want to filter by: filter_list = ['abc', 'jkl']
so that I keep these values where they are found in the df column.
I want to filter the dataframe column if the value in the list is contained in the column, such that the final output in this case would be 'column' = ['abc', 'abc, def', 'ghi, jkl', 'abc']
column
0 abc
1 abc, def
2 ghi, jkl
3 abc
I want to keep the rows where "abc"/"jkl" are contained.
My first thinking was using list comprehension in a lambda function, but that throws an error:
dataframe = dataframe[dataframe['column'].apply(labmda x: f for f in filter_list in x])]
Note I cannot use the .isin()
method because that wont capture the complex strings like 'abc, def'
Looking for thought's on how to do this. I cannot use the re package. I only have access to pandas/numpy.
UPDATE:
Big thank you to Balaji! The working solution:
keeping list values:
dataframe[pd.notna(dataframe[column]) & dataframe[column].astype(str).str.contains('|'.join(filter_list))]
filtering out:
dataframe[pd.notna(dataframe[column]) & ~dataframe[column].astype(str).str.contains('|'.join(filter_list))]
Upvotes: 2
Views: 6266
Reputation: 5012
Here you go:
df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})
contains_filter = '|'.join(filter_list)
df = df[pd.notna(df.column) & df.column.str.contains(contains_filter)]
Output:
column
0 abc
3 abc, def
4 ghi, jkl
5 abc
Upvotes: 4
Reputation: 23099
str,contains
is the usual way to go, but lets look at an alternate method. We can use .stack
and isin
if your string column is separated by commas.
filter_list = ['abc', 'jkl']
s = df['column'].str.split(',',expand=True).stack().str.strip()
df_filter = df.loc[s[s.isin(filter_list)].index.get_level_values(0)]
print(df_filter)
column
0 abc
3 abc, def
4 ghi, jkl
5 abc
Upvotes: 1