CharlieR
CharlieR

Reputation: 65

How To Compare Two Pandas DataFrames and Show Differences In DataFrame 2

I currently have two pandas dataframes:

sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215}]
sales2 = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]
test_1 = pd.DataFrame(sales)
test_2 = pd.DataFrame(sales2)

What I want to achieve is to show only the differences that are in 'test_2' and not in 'test_1'.

The code I currently have concatenates the two dataframes and shows me the total difference across both dataframes however all I want to see if the differences in 'test_2' to 'test_1' and not the reverse:

def compare_dataframes(df1, df2):

    print 'Comparing dataframes...'
    df = pd.concat([df1, df2])
    df = df.reset_index(drop=True)
    df_gpby = df.groupby(list(df.columns))
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
    compared_data = df.reindex(idx)
    if len(compared_data) > 1:
        print 'No new sales on site!'
    else:
        print 'New sales on site!'
        print(compared_data)

How could I adapt my current function to work like this?

Upvotes: 2

Views: 527

Answers (2)

Oncess
Oncess

Reputation: 16

import pandas as pd
import numpy as np
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215}]
sales2 = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]
test_1 = pd.DataFrame(sales)
test_2 = pd.DataFrame(sales2)
test_3 = test_1.append(test_2).drop_duplicates(keep=False)
print (test_3)

it prints the different rows

Upvotes: 0

jezrael
jezrael

Reputation: 862601

Use merge with outer join and indicator parameter:

df = test_1.merge(test_2, how='outer', indicator=True)
print (df)
   Feb  Jan  Mar    account      _merge
0  200  150  140  Jones LLC        both
1  210  200  215   Alpha Co        both
2   90   50   95   Blue Inc  right_only

And then filter only right_only rows by boolean indexing:

only2 = df[df['_merge'] == 'right_only']
print (only2)
   Feb  Jan  Mar   account      _merge
2   90   50   95  Blue Inc  right_only

Thanks @Jon Clements for one line solution with callback:

only2 = test_1.merge(test_2, how='outer', indicator=True)[lambda r: r._merge == 'right_only']
print (only2)
   Feb  Jan  Mar   account      _merge
2   90   50   95  Blue Inc  right_only

Or use query:

only2 = test_1.merge(test_2, how='outer', indicator=True).query("_merge == 'right_only'")

Upvotes: 5

Related Questions