Reputation: 25
This is a follow-up to the following post: Pandas dataframe select rows where a list-column contains any of a list of strings
I want to be able to select rows that contain the exact pair of strings from the selection list (where selection= ['cat', 'dog']).
starting df:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
df I want:
molecule species
2 c [cat, dog]
I tried the following and it returned only the columns labels.
df[pd.DataFrame(df.species.tolist()).isin(selection).all(1)]
Upvotes: 0
Views: 1684
Reputation: 56
This will find anything.
import pandas as pd
selection = ['cat', 'dog']
mols = pd.DataFrame({'molecule':['a','b','c','d','e'],'species':[['dog'],['horse','pig'],['cat','dog'],['cat','horse','pig'],['chicken','pig']]})
mols.loc[np.where(pd.Series([all(w in selection for w in mols.species.values[k]) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
If you want to find any rows that have at least the elements in the list (and could have others as well), use:
mols.loc[np.where(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
This is an interesting application of matrices as selectors. Use the transposed mols to multiply the vector of zeroes and ones that points which rows in mols fit your criteria:
mols.to_numpy().T.dot(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}))
Another (more readable) solution would be to assign, to mols, a column where the condition is True, map it to 0 and 1 and query mols where that column is equal to 1.
Upvotes: 0
Reputation: 5183
One way to do it:
df['joined'] = df.species.str.join(sep=',')
selection = ['cat,dog']
filtered = df.loc[df.joined.isin(selection)]
This won't find cases with different sorting (i.e. 'dog,cat'
or 'horse,cat,pig'
), but if that is not an issue then it works fine.
Upvotes: 1