Reputation: 33
i have two dataframes df1 and df2
df1:
categories ;
['hello','world']
['gogo','albert']
['dodo']
df2:
categories ;
['hello','world']
['albert']
['dodji']
i want to have as result only lines of df1 based on : if the intersection of df1 and df2 is true == keep this kine of df1 : for example for our case we will have :
df_all:
categories ;
['hello','world']
['gogo','albert']
because the intersection of ['hello','world'] of df1 and ['hello','world'] of df2 is true and the intersection of ['gogo','albert'] and ['albert'] is true so we keep those lines of df1
Upvotes: 0
Views: 154
Reputation: 11657
Pandas isn't optimised for Series consisting of lists. I think the best solution is just to use Python sets and check length is nonzero, then use that to mask df1
:
# Set up data
df1 = pd.DataFrame({'categories': [['hello','world'],['gogo','albert'],['dodo']]})
df2 = pd.DataFrame({'categories': [['hello','world'],['albert'],['dodji']]})
# Solution
mask = [len(set(a).intersection(b)) > 0
for (a,b) in zip(df1.categories, df2.categories)]
df1.loc[mask]
Output:
categories
0 [hello, world]
1 [gogo, albert]
Upvotes: 1