Reputation: 1971
Set-up
I have 2 pandas dfs (df1 and df2) which contain some overlapping rows and some non-overlapping rows.
Both dfs have the columns order_id
and shop
.
Now, if a df1 row matches with a df2 row on the combination of order_id
and shop
, then nothing. But if a df1 row doesn't match with a df2 row on the combination of order_id
and shop
, then this df1 row should be added to df2.
Example
df2
is such that,
order_id shop
0 12345 'NL'
1 45678 'FR'
2 12345 'DE'
3 34567 'NL'
Now if df1
such that,
order_id shop
0 12345 'NL'
1 45678 'FR'
then nothing.
But if df1
such that,
order_id shop
0 12345 'NL'
1 99999 'FR'
2 12345 'UK'
then row 1 and 2 should be added to df2
even though shop
value for row 1 and order_id
for row 2 are already in df2
.
The resulting df2
should look like,
order_id shop
0 99999 'FR'
1 12345 'UK'
2 12345 'NL'
3 45678 'FR'
4 12345 'DE'
5 34567 'NL'
Note that the order_id
column is int and the shop
column is string.
So in graphical terms what I want to achieve looks like:
Code
I created a monstrous line which then didn't really work...
So far, I have,
result_df = df1[(~df1['order_id'].astype(str).isin(df2['order_id'].astype(str)))]
How do I solve this?
EXTRA
if df1
such that,
order_id shop
0 12345 'NL'
1 99999 'FR'
2 12345 'UK'
how do I compare df1
with df2
such that I get df3
which looks like,
order_id shop
0 99999 'FR'
1 12345 'UK'
Upvotes: 2
Views: 1917
Reputation: 946
Looks like you wish to create a set of all the possible combinations? If so, you can create the unique pairs using the following:
import pandas as pd
data1 = {'order_id': [12345, 45678, 78901, 12345, 12901, 12345], 'shop': ['NL', 'FR', 'AB', 'AB', 'NL', 'NL']}
df1 = pd.DataFrame(data=data1)
data2= {'order_id': [12345, 45678, 12345, 34567], 'shop': ['NL', 'FR', 'DE', 'NL']}
df2 = pd.DataFrame(data=data2)
df3 = df1
df3['Combi'] = df3['order_id'].astype('str') + df3['shop']
df3.drop_duplicates('Combi', inplace=True)
Upvotes: 0
Reputation: 862406
If all rows are unique use concat
with drop_duplicates
:
df = pd.concat([df2, df1], ignore_index=True).drop_duplicates()
print (df)
order_id shop
0 12345 'NL'
1 45678 'FR'
2 12345 'DE'
3 34567 'NL'
5 99999 'FR'
6 12345 'UK'
If not unique filter not equal rows by merge
with indicator=True
and then concat
to df2
:
df3 = df1.merge(df2, how='left', indicator=True).query('_merge == "left_only"')[df1.columns]
df = pd.concat([df2, df3], ignore_index=True)
Upvotes: 3