LucSpan
LucSpan

Reputation: 1971

Check if two column values found in other pandas dataframe

Set-up

I have 2 pandas dfs (df1 and df2) which contain some overlapping rows and some non-overlapping rows.

Both dfs have the columns order_id and shop.

Now, if a df1 row matches with a df2 row on the combination of order_id and shop, then nothing. But if a df1 row doesn't match with a df2 row on the combination of order_id and shop, then this df1 row should be added to df2.


Example

df2 is such that,

    order_id    shop
0     12345     'NL'
1     45678     'FR'
2     12345     'DE'
3     34567     'NL'

Now if df1 such that,

    order_id    shop
0     12345     'NL'
1     45678     'FR'

then nothing.


But if df1 such that,

        order_id    shop
0       12345       'NL'
1       99999       'FR'
2       12345       'UK'

then row 1 and 2 should be added to df2 even though shop value for row 1 and order_id for row 2 are already in df2.

The resulting df2 should look like,

    order_id    shop
0     99999     'FR'
1     12345     'UK'        
2     12345     'NL'
3     45678     'FR'
4     12345     'DE'
5     34567     'NL' 

Note that the order_id column is int and the shop column is string.

So in graphical terms what I want to achieve looks like:

enter image description here

Code

I created a monstrous line which then didn't really work...

So far, I have,

result_df = df1[(~df1['order_id'].astype(str).isin(df2['order_id'].astype(str)))]

How do I solve this?


EXTRA

if df1 such that,

        order_id    shop
0       12345       'NL'
1       99999       'FR'
2       12345       'UK'

how do I compare df1 with df2 such that I get df3 which looks like,

        order_id    shop
0       99999       'FR'
1       12345       'UK'

Upvotes: 2

Views: 1917

Answers (2)

cyneo
cyneo

Reputation: 946

Looks like you wish to create a set of all the possible combinations? If so, you can create the unique pairs using the following:

import pandas as pd

data1 = {'order_id': [12345, 45678, 78901, 12345, 12901, 12345], 'shop': ['NL', 'FR', 'AB', 'AB', 'NL', 'NL']}
df1 = pd.DataFrame(data=data1)

data2= {'order_id': [12345, 45678, 12345, 34567], 'shop': ['NL', 'FR', 'DE', 'NL']}
df2 = pd.DataFrame(data=data2)

df3 = df1
df3['Combi'] = df3['order_id'].astype('str') + df3['shop']
df3.drop_duplicates('Combi', inplace=True)

df1 df3

Upvotes: 0

jezrael
jezrael

Reputation: 862406

If all rows are unique use concat with drop_duplicates:

df = pd.concat([df2, df1], ignore_index=True).drop_duplicates()
print (df)
   order_id  shop
0     12345  'NL'
1     45678  'FR'
2     12345  'DE'
3     34567  'NL'
5     99999  'FR'
6     12345  'UK'

If not unique filter not equal rows by merge with indicator=True and then concat to df2:

df3 = df1.merge(df2, how='left', indicator=True).query('_merge == "left_only"')[df1.columns]
df = pd.concat([df2, df3], ignore_index=True)

Upvotes: 3

Related Questions