sumathi
sumathi

Reputation: 55

How to iterate over rows between 2 different dataframes and capture the entire row

Is there way easier way to capture the difference in data between two dataframe. I am trying to compare 2 dataframe and find the missing. I found this attached link but it's not satisfying my requirement. In that post they comparing using Id but I wanted to compare the entire rows of 2 dataframes and find the mismatch row whether it is from A or B.

Iterate over different dataframe

DF 1:

country    from         to       flag
('GB',  'GB390065', 'GB66903',  'Y')
('FR',  'FR418458', 'FR765617', 'Y')
('FR',  'FR629810', 'FR855277', 'Y')
('FR',  'FR494010', 'FR332891', 'Y')
('FR',  'FR740500', 'FR907878', 'Y')
('FI',  'NB444135', 'NB234471', 'Y')

DF 2:

country    from         to       flag
('GB',  'GB390065', 'GB66903',  'Y')
('FR',  'FR418458', 'FR7656',   'Y')
('FR',  'FR629810', 'FR855277', 'Y')
('FR',  'FR4910',   'FR33891',  'Y')
('FR',  'FR740500', 'FR907878', 'Y')
('FI',  'NB444135', 'NB234471', 'Y')

My expected output:

country    from         to       flag   Available
('FR',  'FR418458', 'FR7656',   'Y'       df2)
('FR',  'FR4910',   'FR33891',  'Y'       df2)

The challenge I face here is from column and to column are same data only so when I tried using merge function, it is taking any one of the column and giving the result which is not meeting the requirement.

The output I got is

               country_code from_cust_id  ... to_cust_guid Merge_status
       322443         DE       DE180556  ...          NaN    left_only
       322444         DE        DE22191  ...          NaN    left_only
       322445         DE      DE2625168  ...          NaN    left_only
       322446         DE        DE17705  ...          NaN    left_only
       322447         DE      DE2556758  ...          NaN    left_only

      [5 rows x 7 columns]
              country_code from_cust_id  ... to_cust_guid Merge_status
       0                DE          NaN  ...     DE485137   right_only
       1                DE          NaN  ...     DE467209   right_only
       2                DE          NaN  ...     DE651068   right_only
       3                DE          NaN  ...     DE459729   right_only
       4                DE          NaN  ...     DE448683   right_only

Upvotes: 1

Views: 140

Answers (3)

Arun V
Arun V

Reputation: 11

Try this

df1 = pd.DataFrame(Mylist1,columns=['country','from','to','flag'])
df2 = pd.DataFrame(Mylist2,columns=['country','from','to','flag'])
# print(df2)
indexes = (df1 != df2).any(axis=1)
indexes1 = (df2 != df1).any(axis=1)
df3 = df2.loc[indexes]
df4 = df1.loc[indexes1]
pd.options.mode.chained_assignment = None
df3['Available'] = 'df1'
df4['Available'] = 'df2'
# print(df3)
# print(df4)
df_row_reindex = pd.concat([df3, df4], ignore_index=False)
print(df_row_reindex)


country from    to         flag Available
FR  FR418458    FR7656     Y    df1
FR  FR4910      FR33891    Y    df1
FR  FR418458    FR765617   Y    df2
FR  FR494010    FR332891   Y    df2

Upvotes: 1

YOLO
YOLO

Reputation: 21709

Here's a way you can do by converting dataframe to a set of tuples:

a1 = set(df1.apply(tuple,1).values)
a2 = set(df2.apply(tuple,1).values)

print(a1)

{('FI', 'NB444135', 'NB234471', 'Y'),
 ('FR', 'FR418458', 'FR765617', 'Y'),
 ('FR', 'FR494010', 'FR332891', 'Y'),
 ('FR', 'FR629810', 'FR855277', 'Y'),
 ('FR', 'FR740500', 'FR907878', 'Y'),
 ('GB', 'GB390065', 'GB66903', 'Y')}


print(a2)

{('FI', 'NB444135', 'NB234471', 'Y'),
 ('FR', 'FR418458', 'FR7656', 'Y'),
 ('FR', 'FR4910', 'FR33891', 'Y'),
 ('FR', 'FR629810', 'FR855277', 'Y'),
 ('FR', 'FR740500', 'FR907878', 'Y'),
 ('GB', 'GB390065', 'GB66903', 'Y')}

# now do set difference (in df2 not in df1)
diffs = pd.DataFrame(np.array([x for x in a2 - a1]))
diffs['Available'] = 'df2'

print(diffs)

    0         1        2  3 Available
0  FR  FR418458   FR7656  Y       df2
1  FR    FR4910  FR33891  Y       df2

Upvotes: 2

Vishnudev Krishnadas
Vishnudev Krishnadas

Reputation: 10960

Try

df1['Available'] = 'df1'
df2['Available'] = 'df2'

available = pd.concat([df1, df2]).drop_duplicates(subset=['from', 'to'], keep=False)

Upvotes: 2

Related Questions