Reputation: 1134
I am trying to figure out how to display the differences between 2 Pandas DataFrames. I am nearly there but can't seem to figure out how to display additional data for rows that contain a difference.
Here is what I have so far:
Compare DataFrame A to DataFrame B:
DataFrame A:
Date ID_1 ID_2 Value
1-Jan 1 1 5
2-Jan 1 2 6
3-Jan 1 3 4
4-Jan 1 4 2
5-Jan 1 5 8
DataFrame B:
Date ID_1 ID_2 Value
1-Jan 1 1 5
2-Jan 1 2 6
3-Jan 1 3 4
4-Jan 1 4 2
5-Jan 1 5 55
Current Output:
Date Column From To
5-Jan Value 8 55
Desired Output:
Date ID_1 ID_2 From To
5-Jan 1 5 8 55
Current Code:
#stack column(s) where dataframes are not equal
ne_stacked = (df1 != df2).stack()
#create new dataframe from ne_stacked
changed = ne_stacked[ne_stacked]
#change column names
changed.index.names = ['Date', 'Column']
#create array where dataframes are not equal
diff_loc = np.where(df1 != df2)
#create 'from' column
changed_from = df1.values[diff_loc]
#create 'to' column
changed_to = df2.values[diff_loc]
#create a summary dataframe
final = pd.DataFrame({'From': changed_from, 'To': changed_to}, index=changed.index)
print final
Upvotes: 0
Views: 210
Reputation: 54380
use merge
:
In [29]:
print df_a
Date ID_1 ID_2 Value
0 1-Jan 1 1 5
1 2-Jan 1 2 6
2 3-Jan 1 3 4
3 4-Jan 1 4 2
4 5-Jan 1 5 8
In [30]:
print df_b
Date ID_1 ID_2 Value
0 1-Jan 1 1 5
1 2-Jan 1 2 6
2 3-Jan 1 3 4
3 4-Jan 1 4 2
4 5-Jan 1 5 55
In [31]:
df_c = pd.merge(df_a, df_b,
how='outer',
on=['Date', 'ID_1', 'ID_2'])
df_c.columns = ['Date', 'ID_1', 'ID_2', 'From', 'To']
df_c = df_c[df_c.From!=df_c.To]
print df_c
Date ID_1 ID_2 From To
4 5-Jan 1 5 8 55
Upvotes: 2
Reputation: 10302
Try this:
dfm = df1.merge(df2, on=['Date', 'ID_1', 'ID_2']).rename(columns={'Value_x':'From', 'Value_y':'To'})
print dfm[dfm.From != dfm.To]
Date ID_1 ID_2 From To
4 5-Jan 1 5 8 55
Upvotes: 1