Reputation: 55
Is there way easier way to capture the difference in data between two dataframe. I am trying to compare 2 dataframe and find the missing. I found this attached link but it's not satisfying my requirement. In that post they comparing using Id but I wanted to compare the entire rows of 2 dataframes and find the mismatch row whether it is from A or B.
Iterate over different dataframe
DF 1:
country from to flag
('GB', 'GB390065', 'GB66903', 'Y')
('FR', 'FR418458', 'FR765617', 'Y')
('FR', 'FR629810', 'FR855277', 'Y')
('FR', 'FR494010', 'FR332891', 'Y')
('FR', 'FR740500', 'FR907878', 'Y')
('FI', 'NB444135', 'NB234471', 'Y')
DF 2:
country from to flag
('GB', 'GB390065', 'GB66903', 'Y')
('FR', 'FR418458', 'FR7656', 'Y')
('FR', 'FR629810', 'FR855277', 'Y')
('FR', 'FR4910', 'FR33891', 'Y')
('FR', 'FR740500', 'FR907878', 'Y')
('FI', 'NB444135', 'NB234471', 'Y')
My expected output:
country from to flag Available
('FR', 'FR418458', 'FR7656', 'Y' df2)
('FR', 'FR4910', 'FR33891', 'Y' df2)
The challenge I face here is from column
and to column
are same data only so when I tried using merge function
, it is taking any one of the column and giving the result which is not meeting the requirement.
The output I got is
country_code from_cust_id ... to_cust_guid Merge_status
322443 DE DE180556 ... NaN left_only
322444 DE DE22191 ... NaN left_only
322445 DE DE2625168 ... NaN left_only
322446 DE DE17705 ... NaN left_only
322447 DE DE2556758 ... NaN left_only
[5 rows x 7 columns]
country_code from_cust_id ... to_cust_guid Merge_status
0 DE NaN ... DE485137 right_only
1 DE NaN ... DE467209 right_only
2 DE NaN ... DE651068 right_only
3 DE NaN ... DE459729 right_only
4 DE NaN ... DE448683 right_only
Upvotes: 1
Views: 140
Reputation: 11
Try this
df1 = pd.DataFrame(Mylist1,columns=['country','from','to','flag'])
df2 = pd.DataFrame(Mylist2,columns=['country','from','to','flag'])
# print(df2)
indexes = (df1 != df2).any(axis=1)
indexes1 = (df2 != df1).any(axis=1)
df3 = df2.loc[indexes]
df4 = df1.loc[indexes1]
pd.options.mode.chained_assignment = None
df3['Available'] = 'df1'
df4['Available'] = 'df2'
# print(df3)
# print(df4)
df_row_reindex = pd.concat([df3, df4], ignore_index=False)
print(df_row_reindex)
country from to flag Available
FR FR418458 FR7656 Y df1
FR FR4910 FR33891 Y df1
FR FR418458 FR765617 Y df2
FR FR494010 FR332891 Y df2
Upvotes: 1
Reputation: 21709
Here's a way you can do by converting dataframe to a set of tuples:
a1 = set(df1.apply(tuple,1).values)
a2 = set(df2.apply(tuple,1).values)
print(a1)
{('FI', 'NB444135', 'NB234471', 'Y'),
('FR', 'FR418458', 'FR765617', 'Y'),
('FR', 'FR494010', 'FR332891', 'Y'),
('FR', 'FR629810', 'FR855277', 'Y'),
('FR', 'FR740500', 'FR907878', 'Y'),
('GB', 'GB390065', 'GB66903', 'Y')}
print(a2)
{('FI', 'NB444135', 'NB234471', 'Y'),
('FR', 'FR418458', 'FR7656', 'Y'),
('FR', 'FR4910', 'FR33891', 'Y'),
('FR', 'FR629810', 'FR855277', 'Y'),
('FR', 'FR740500', 'FR907878', 'Y'),
('GB', 'GB390065', 'GB66903', 'Y')}
# now do set difference (in df2 not in df1)
diffs = pd.DataFrame(np.array([x for x in a2 - a1]))
diffs['Available'] = 'df2'
print(diffs)
0 1 2 3 Available
0 FR FR418458 FR7656 Y df2
1 FR FR4910 FR33891 Y df2
Upvotes: 2
Reputation: 10960
Try
df1['Available'] = 'df1'
df2['Available'] = 'df2'
available = pd.concat([df1, df2]).drop_duplicates(subset=['from', 'to'], keep=False)
Upvotes: 2