michel gold
michel gold

Reputation: 33

merge two dataframe based on intersection

i have two dataframes df1 and df2

df1:

categories ; 
['hello','world']
['gogo','albert']
['dodo']

df2:

categories ; 
['hello','world']
['albert']
['dodji']

i want to have as result only lines of df1 based on : if the intersection of df1 and df2 is true == keep this kine of df1 : for example for our case we will have :

df_all:

categories ; 
['hello','world']
['gogo','albert']

because the intersection of ['hello','world'] of df1 and ['hello','world'] of df2 is true and the intersection of ['gogo','albert'] and ['albert'] is true so we keep those lines of df1

Upvotes: 0

Views: 154

Answers (1)

Josh Friedlander
Josh Friedlander

Reputation: 11657

Pandas isn't optimised for Series consisting of lists. I think the best solution is just to use Python sets and check length is nonzero, then use that to mask df1:

# Set up data
df1 = pd.DataFrame({'categories': [['hello','world'],['gogo','albert'],['dodo']]})
df2 = pd.DataFrame({'categories': [['hello','world'],['albert'],['dodji']]})

# Solution
mask = [len(set(a).intersection(b)) > 0 
        for (a,b) in zip(df1.categories, df2.categories)]
df1.loc[mask]

Output:

    categories
0   [hello, world]
1   [gogo, albert]

Upvotes: 1

Related Questions