KirklandShawty
KirklandShawty

Reputation: 263

Comparing differences in a column in two dataframes in Pandas

I have two datasets and I'm trying to compare only a single column based on unique ID. I want to track and flag any value changes in that column, and output those changes into another DF.

DF1:

ID      Status
1234    Cleared
4321    Pending
5678    Distributed
8765    Validating
2468    Blocked
8642    Pending
1357    Pending
7531    Distributed

DF2:

ID      Status
1234    Distributed
4321    Pending
5678    Pending
8765    Cleared
2468    Blocked
8642    Blocked
1357    Cleared
7531    Blocked

Output:

ID      Status       Status
1234    Cleared      Distributed
5678    Distributed  Pending
8765    Validating   Cleared
8642    Pending      Blocked
1357    Pending      Cleared
7531    Distributed  Blocked

Lastly, I am also trying to see any changes in another column based on changes in the status column. This column includes a list of countries using the standard ISO Alpha-2 country codes. Was thinking of doing a simple character count here but that doesn't make sense because if US was removed and replaced with DE, then character count would stay the same.

My code for all of this (repurposed from other questions here) is as follows, but I feel like there's probably a way more efficient way of doing this...

for index, compare_row in compare_df.iterrows():
row_df1 = df1.loc[df1['ID'] == compare_row['ID']]    
row_df2 = df2.loc[df2['ID'] == compare_row['ID']]    
if (row_df1.iloc[0]['Status'] != row_df2.iloc[0]['Status']):
    print "here 1"
    output_df.append(row_df1)
    output_df.append(row_df2)
elif (row_df1.iloc[0]['Status'] in ['Cleared', 'Distributed']) & (row_df1.iloc[0]['Territory'] != row_df2.iloc[0]['Territory']):
    print "here 2"
    output_df.append(row_df1)
    output_df.append(row_df2)

Upvotes: 1

Views: 79

Answers (3)

Scott Boston
Scott Boston

Reputation: 153460

Use .query for improved readability.

DF1.merge(DF2, on = 'ID').query('Status_x != Status_y')

Output:

     ID     Status_x     Status_y
0  1234      Cleared  Distributed
2  5678  Distributed      Pending
3  8765   Validating      Cleared
5  8642      Pending      Blocked
6  1357      Pending      Cleared
7  7531  Distributed      Blocked

Upvotes: 0

ipramusinto
ipramusinto

Reputation: 2648

It might not be the most effective way, but at least reach the goal. :)

df3 = df1.copy()
df3['Status_df2'] = df2.Status.copy()
df3 = df3.loc[df3.Status != df3.Status_df2]

Upvotes: 0

Vishnu Kunchur
Vishnu Kunchur

Reputation: 1726

Use merge:

df3 = df1.merge(df2, left_index = True, right_index = True)
mask = df3['Status_x'] == df3['Status_y']
df3 = df3[~mask]

Upvotes: 2

Related Questions