Append / insert / concat rows to dataframe only if record is not already present

Question

I have two dataframes of customer information:

df1 = pd.DataFrame({'firstname':['jack','john','donald'],
                  'lastname':['ryan','obrien','trump'],
                   'email':['mymail@gmail.com','hismail@gmail.com','email@website.com'],
                   'bank_account':['abcd123','jhkf123','kdlk123']})

print(df1)

  firstname lastname              email bank_account
0      jack     ryan   mymail@gmail.com      abcd123
1      john   obrien  hismail@gmail.com      jhkf123
2    donald    trump  email@website.com      kdlk123


df2 = pd.DataFrame({'firstname':['jack','patrick','barak'],
                  'lastname':['ryan','murphy','obama'],
                   'email':['mymail@gmail.com','some@email.com','other@email.com'],
                   'bank_account':[pd.np.nan]*3})

print(df2)


  firstname lastname             email  bank_account
0      jack     ryan  mymail@gmail.com           NaN
1   patrick   murphy    some@email.com           NaN
2     barak    obama   other@email.com           NaN

I want to insert the records from df2 into df1 but only if they are not present in df1.

For example we can see that jack ryan is present in df2 and df1, so i don't want him to be inserted into the df1.

The primary key in this situation can be the email. If the email exists in df1, do not insert the record.

I've been experimenting and googling with pd.concat for the last while, setting email as the index etc. and can't get the result I want, which is this:

  firstname lastname              email  mobile       address bank_account
0      jack     ryan   mymail@gmail.com   12346   main street      abcd123
1      john   obrien  hismail@gmail.com   51234   high street      jhkf123
2    donald    trump  email@website.com   54856   white house      kdlk123
3   patrick   murphy     some@email.com    6548  north street          NaN
4    barack    obama    other@email.com    2135       florida          NaN

You can see in the expected output that jack ryan has not been appended to the new dataframe, as the email was checked before appending the data.

harpan · Accepted Answer

You simply need to concat and then use drop-duplicates

pd.concat([df1,df2], ignore_index=True).drop_duplicates('email')

Output:

      firstname lastname              email  mobile       address bank_account
0      jack     ryan   mymail@gmail.com   12346   main street      abcd123
1      john   obrien  hismail@gmail.com   51234   high street      jhkf123
2    donald    trump  email@website.com   54856   white house      kdlk123
3   patrick   murphy     some@email.com    6548  north street          NaN
4    barack    obama    other@email.com    2135       florida          NaN

Append / insert / concat rows to dataframe only if record is not already present

Answers (2)

Related Questions