TrimPeachu
TrimPeachu

Reputation: 13

Find difference between two dataframes

I have two dataframes old_df and new_df that I am reading from SQL Server database. They both contain data on various products and their prices. I need to compare these two dataframes and create table, that would look like this: (dummy data)

Seller Code Product Old_Price New_Price Status
MC oTMFtAS8w6 Golden Apple 3360 3200 Price change
MC kYp2jkorw6 washing machine 765 New Product
MC UGl93h4qGo outlet 1155 Product deleted

I have come up with the solution that is really ineffective due to the last iterrows function:

def compare(old_df,new_df,sellers_list):
df_changes = pd.DataFrame(columns = ['Seller' , 'Code', 'Product' , 'Old_Price', 'New_Price', 'Status'])

for seller in sellers_list:
    old_df[seller]['Concat'] = old_df[seller]['Code'] + old_df[seller]['Product']
    new_df[seller]['Concat'] = new_df[seller]['Code'] + new_df[seller]['Product']

    temp_df1 = old_df[seller].set_index('Concat')
    temp_df2 = new_df[seller].set_index('Concat')
    temp_df1 = temp_df1[['Seller','Product', 'Code' ,'Price']]
    temp_df2 = temp_df2[['Seller','Product', 'Code','Price']]
    
    df_concat = pd.concat([temp_df1,temp_df2], axis= 'columns', keys= ['Old', 'New']).drop_duplicates(keep= False)
    
    final_df = pd.DataFrame()
    final_df = final_df.append(df_concat[df_concat[('Old',    'Price')] != df_concat[('New', 'Price')]])

    for index, row in final_df.iterrows():
        if pd.isnull(row[('Old',    'Seller')]):
            df_changes = df_changes.append({'Seller' : row[('New',    'Seller')], 'Code' : row[('New',    'Code')], 'Product': row[('New',    'Product')] , 'Old_Price' : None, 'New_Price' : row[('New',    'Price')], 'Status': 'New Product'}, ignore_index= True)
        elif pd.isnull(row[('New',    'Seller')]):
            df_changes = df_changes.append({'Seller' : row[('Old',    'Seller')], 'Code' : row[('Old',    'Code')], 'Product': row[('Old',    'Product')] , 'Old_Price' : row[('Old',    'Price')], 'New_Price' : None , 'Status': 'Product discontinued'}, ignore_index= True)
        else:
            df_changes = df_changes.append({'Seller' : row[('New',    'Seller')], 'Code' : row[('New',    'Code')], 'Product': row[('New',    'Product')] , 'Old_Price' : row[('Old',    'Price')], 'New_Price' : row[('New',    'Price')], 'Status': 'Price change'}, ignore_index= True)
     

return df_changes

I am using concat as sometimes there is one Code for two Products (the difference is in the Product name). Could you help me find a more effective solution to my problem?

Upvotes: 1

Views: 223

Answers (1)

Corralien
Corralien

Reputation: 120559

This is the perfect case for merge and the indicator parameter:

Input data:

>>> df1  # old products
  Seller        Code       Product  Price
0     MC  oTMFtAS8w6  Golden Apple   3360
1     MC  UGl93h4qGo        outlet   1155

>>> df2  # new products
  Seller        Code          Product  Price
0     MC  oTMFtAS8w6     Golden Apple   3200
1     MC  kYp2jkorw6  washing machine    765
out = pd.merge(df1, df2, on='Product', how='outer',
               suffixes=('', '_new'), indicator=True)

# Remove products where price has unchanged
out = out[out['Price'] !=  out['Price_new']]

out['Status'] = out['_merge'].replace({'both': 'Price change',
                                       'left_only': 'Product deleted',
                                       'right_only': 'New Product'})

out.update(out[['Seller', 'Seller_new', 'Code', 'Code_new']].bfill(axis=1))

out = out[['Seller', 'Code', 'Product', 'Price', 'Price_new', 'Status']]

Output result

>>> out
  Seller        Code          Product   Price  Price_new           Status
0     MC  oTMFtAS8w6     Golden Apple  3360.0     3200.0     Price change
1     MC  UGl93h4qGo           outlet  1155.0        NaN  Product deleted
2     MC  kYp2jkorw6  washing machine     NaN      765.0      New Product

Upvotes: 1

Related Questions