Coder
Coder

Reputation: 25

Pandas dataframe select rows where a list-column contains a specific set of elements

This is a follow-up to the following post: Pandas dataframe select rows where a list-column contains any of a list of strings

I want to be able to select rows that contain the exact pair of strings from the selection list (where selection= ['cat', 'dog']).

starting df:

  molecule            species
0        a              [dog]
1        b       [horse, pig]
2        c         [cat, dog]
3        d  [cat, horse, pig]
4        e     [chicken, pig]

df I want:

  molecule            species
2        c         [cat, dog]

I tried the following and it returned only the columns labels.

df[pd.DataFrame(df.species.tolist()).isin(selection).all(1)]

Upvotes: 0

Views: 1684

Answers (2)

Denis Alves
Denis Alves

Reputation: 56

This will find anything.

import pandas as pd
selection = ['cat', 'dog']
mols = pd.DataFrame({'molecule':['a','b','c','d','e'],'species':[['dog'],['horse','pig'],['cat','dog'],['cat','horse','pig'],['chicken','pig']]})
mols.loc[np.where(pd.Series([all(w in selection for w in mols.species.values[k]) for k in mols.index]).map({True:1,False:0}) == 1)[0]]

If you want to find any rows that have at least the elements in the list (and could have others as well), use:

mols.loc[np.where(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}) == 1)[0]]

This is an interesting application of matrices as selectors. Use the transposed mols to multiply the vector of zeroes and ones that points which rows in mols fit your criteria:

mols.to_numpy().T.dot(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}))

Another (more readable) solution would be to assign, to mols, a column where the condition is True, map it to 0 and 1 and query mols where that column is equal to 1.

Upvotes: 0

RichieV
RichieV

Reputation: 5183

One way to do it:

df['joined'] = df.species.str.join(sep=',')
selection = ['cat,dog']
filtered = df.loc[df.joined.isin(selection)]

This won't find cases with different sorting (i.e. 'dog,cat' or 'horse,cat,pig'), but if that is not an issue then it works fine.

Upvotes: 1

Related Questions