Reputation: 13
I have two dataframes old_df and new_df that I am reading from SQL Server database. They both contain data on various products and their prices. I need to compare these two dataframes and create table, that would look like this: (dummy data)
Seller | Code | Product | Old_Price | New_Price | Status |
---|---|---|---|---|---|
MC | oTMFtAS8w6 | Golden Apple | 3360 | 3200 | Price change |
MC | kYp2jkorw6 | washing machine | 765 | New Product | |
MC | UGl93h4qGo | outlet | 1155 | Product deleted |
I have come up with the solution that is really ineffective due to the last iterrows function:
def compare(old_df,new_df,sellers_list):
df_changes = pd.DataFrame(columns = ['Seller' , 'Code', 'Product' , 'Old_Price', 'New_Price', 'Status'])
for seller in sellers_list:
old_df[seller]['Concat'] = old_df[seller]['Code'] + old_df[seller]['Product']
new_df[seller]['Concat'] = new_df[seller]['Code'] + new_df[seller]['Product']
temp_df1 = old_df[seller].set_index('Concat')
temp_df2 = new_df[seller].set_index('Concat')
temp_df1 = temp_df1[['Seller','Product', 'Code' ,'Price']]
temp_df2 = temp_df2[['Seller','Product', 'Code','Price']]
df_concat = pd.concat([temp_df1,temp_df2], axis= 'columns', keys= ['Old', 'New']).drop_duplicates(keep= False)
final_df = pd.DataFrame()
final_df = final_df.append(df_concat[df_concat[('Old', 'Price')] != df_concat[('New', 'Price')]])
for index, row in final_df.iterrows():
if pd.isnull(row[('Old', 'Seller')]):
df_changes = df_changes.append({'Seller' : row[('New', 'Seller')], 'Code' : row[('New', 'Code')], 'Product': row[('New', 'Product')] , 'Old_Price' : None, 'New_Price' : row[('New', 'Price')], 'Status': 'New Product'}, ignore_index= True)
elif pd.isnull(row[('New', 'Seller')]):
df_changes = df_changes.append({'Seller' : row[('Old', 'Seller')], 'Code' : row[('Old', 'Code')], 'Product': row[('Old', 'Product')] , 'Old_Price' : row[('Old', 'Price')], 'New_Price' : None , 'Status': 'Product discontinued'}, ignore_index= True)
else:
df_changes = df_changes.append({'Seller' : row[('New', 'Seller')], 'Code' : row[('New', 'Code')], 'Product': row[('New', 'Product')] , 'Old_Price' : row[('Old', 'Price')], 'New_Price' : row[('New', 'Price')], 'Status': 'Price change'}, ignore_index= True)
return df_changes
I am using concat as sometimes there is one Code for two Products (the difference is in the Product name). Could you help me find a more effective solution to my problem?
Upvotes: 1
Views: 223
Reputation: 120559
This is the perfect case for merge
and the indicator
parameter:
Input data:
>>> df1 # old products
Seller Code Product Price
0 MC oTMFtAS8w6 Golden Apple 3360
1 MC UGl93h4qGo outlet 1155
>>> df2 # new products
Seller Code Product Price
0 MC oTMFtAS8w6 Golden Apple 3200
1 MC kYp2jkorw6 washing machine 765
out = pd.merge(df1, df2, on='Product', how='outer',
suffixes=('', '_new'), indicator=True)
# Remove products where price has unchanged
out = out[out['Price'] != out['Price_new']]
out['Status'] = out['_merge'].replace({'both': 'Price change',
'left_only': 'Product deleted',
'right_only': 'New Product'})
out.update(out[['Seller', 'Seller_new', 'Code', 'Code_new']].bfill(axis=1))
out = out[['Seller', 'Code', 'Product', 'Price', 'Price_new', 'Status']]
Output result
>>> out
Seller Code Product Price Price_new Status
0 MC oTMFtAS8w6 Golden Apple 3360.0 3200.0 Price change
1 MC UGl93h4qGo outlet 1155.0 NaN Product deleted
2 MC kYp2jkorw6 washing machine NaN 765.0 New Product
Upvotes: 1