Reputation: 843
I have two dataframes. Dataframe A has a column that consists of list
values of ids (named items). Dataframe B has a column of int
values of ids (named id).
Dataframe A:
date | items
2019-06-05 | [121, 123, 124]
2019-06-06 | [109, 125]
2019-06-07 | [108, 126]
Dataframe B:
name | id
item1 | 121
item2 | 122
item3 | 123
item4 | 124
item5 | 125
item6 | 126
I want to filter the Dataframe A and keep only the rows that all values of items
in that row exist in the id
column of Dataframe B.
Based on the above example, the result should be:
Dataframe C:
date | items
2019-06-05 | [121, 123, 124]
(since Dataframe B doesn't have rows with id==108 and id==109)
If items
was a int
column, I could use:
dataframe_a[dataframe_a.items.isin(dataframe_b.id)]
How can I achieve this in list
columns?
Upvotes: 3
Views: 4335
Reputation: 49798
Valentino beat me to it, so the idea is the same:
dataframe_a[dataframe_a['items'].apply(lambda lst: all(x in dataframe_b.id.values for x in lst))]
And here are a couple more words on your current approach:
pd.Series.isin
checks whether each element (in your case each list) exists as a whole in the other sequence. While your lists are unordered lists, for a series of tuples order would matter and checking for existence as a whole is the correct/expected behavior.isin(dataframe_b.id)
, which is the same as calling isin(dataframe_b.id.index)
. A pd.Series is like a dictionary and the in
/contains property checks the loc/index (or keys in dictionary terminology), rather than the values themselves. If your loc/index contains ints that happen to overlap with your ids, isin(dataframe_b.id)
could return true unexpectedly:In [17]: dataframe_b
Out[17]:
id
0 121
1 122
2 123
3 124
In [18]: 121 in dataframe_b.id
Out[18]: False
In [19]: 121 in dataframe_b.id.index
Out[19]: False
In [20]: 121 in dataframe_b.id.values
Out[20]: True
In [21]: 1 in dataframe_b.id
Out[21]: True
Upvotes: 2
Reputation: 323226
We can using issubset
l=[set(x).issubset(dfb.id.tolist())for x in df['items']]
Out[64]: [True, False, False]
Then
df=df[l]
Upvotes: 5
Reputation: 7361
You can define your custom function to search if all the elements of the list are in B dataframe and use it with apply.
Here df1
is your Dataframe A and df2
your Dataframe B:
sel = df1.apply(lambda x : all([i in df2['id'].unique() for i in x['items']]), axis=1)
finaldf = df1.loc[sel]
finaldf
is:
date items
0 2019-06-05 [121, 123, 124]
Upvotes: 3