Saeed Esmaili
Saeed Esmaili

Reputation: 843

using isin() for a column that has list values

I have two dataframes. Dataframe A has a column that consists of list values of ids (named items). Dataframe B has a column of int values of ids (named id).

Dataframe A:

date       |    items
2019-06-05 | [121, 123, 124]
2019-06-06 | [109, 125]
2019-06-07 | [108, 126]

Dataframe B:

name  | id
item1 | 121
item2 | 122
item3 | 123
item4 | 124
item5 | 125
item6 | 126

I want to filter the Dataframe A and keep only the rows that all values of items in that row exist in the id column of Dataframe B.

Based on the above example, the result should be:

Dataframe C:

date       |    items
2019-06-05 | [121, 123, 124]

(since Dataframe B doesn't have rows with id==108 and id==109)

If items was a int column, I could use:

dataframe_a[dataframe_a.items.isin(dataframe_b.id)]

How can I achieve this in list columns?

Upvotes: 3

Views: 4335

Answers (3)

Garrett
Garrett

Reputation: 49798

Valentino beat me to it, so the idea is the same:

dataframe_a[dataframe_a['items'].apply(lambda lst: all(x in dataframe_b.id.values for x in lst))]

And here are a couple more words on your current approach:

  • pd.Series.isin checks whether each element (in your case each list) exists as a whole in the other sequence. While your lists are unordered lists, for a series of tuples order would matter and checking for existence as a whole is the correct/expected behavior.
  • Another issue is with calling isin(dataframe_b.id), which is the same as calling isin(dataframe_b.id.index). A pd.Series is like a dictionary and the in/contains property checks the loc/index (or keys in dictionary terminology), rather than the values themselves. If your loc/index contains ints that happen to overlap with your ids, isin(dataframe_b.id) could return true unexpectedly:
In [17]: dataframe_b
Out[17]:
    id
0  121
1  122
2  123
3  124

In [18]: 121 in dataframe_b.id
Out[18]: False

In [19]: 121 in dataframe_b.id.index
Out[19]: False

In [20]: 121 in dataframe_b.id.values
Out[20]: True

In [21]: 1 in dataframe_b.id
Out[21]: True

Upvotes: 2

BENY
BENY

Reputation: 323226

We can using issubset

l=[set(x).issubset(dfb.id.tolist())for x in df['items']]
Out[64]: [True, False, False]

Then

df=df[l]

Upvotes: 5

Valentino
Valentino

Reputation: 7361

You can define your custom function to search if all the elements of the list are in B dataframe and use it with apply.

Here df1 is your Dataframe A and df2 your Dataframe B:

sel = df1.apply(lambda x : all([i in df2['id'].unique() for i in x['items']]), axis=1)
finaldf = df1.loc[sel]

finaldf is:

         date            items
0  2019-06-05  [121, 123, 124]

Upvotes: 3

Related Questions