SCool
SCool

Reputation: 3375

Append / insert / concat rows to dataframe only if record is not already present

I have two dataframes of customer information:

df1 = pd.DataFrame({'firstname':['jack','john','donald'],
                  'lastname':['ryan','obrien','trump'],
                   'email':['[email protected]','[email protected]','[email protected]'],
                   'bank_account':['abcd123','jhkf123','kdlk123']})

print(df1)

  firstname lastname              email bank_account
0      jack     ryan   [email protected]      abcd123
1      john   obrien  [email protected]      jhkf123
2    donald    trump  [email protected]      kdlk123


df2 = pd.DataFrame({'firstname':['jack','patrick','barak'],
                  'lastname':['ryan','murphy','obama'],
                   'email':['[email protected]','[email protected]','[email protected]'],
                   'bank_account':[pd.np.nan]*3})

print(df2)


  firstname lastname             email  bank_account
0      jack     ryan  [email protected]           NaN
1   patrick   murphy    [email protected]           NaN
2     barak    obama   [email protected]           NaN

I want to insert the records from df2 into df1 but only if they are not present in df1.

For example we can see that jack ryan is present in df2 and df1, so i don't want him to be inserted into the df1.

The primary key in this situation can be the email. If the email exists in df1, do not insert the record.

I've been experimenting and googling with pd.concat for the last while, setting email as the index etc. and can't get the result I want, which is this:

  firstname lastname              email  mobile       address bank_account
0      jack     ryan   [email protected]   12346   main street      abcd123
1      john   obrien  [email protected]   51234   high street      jhkf123
2    donald    trump  [email protected]   54856   white house      kdlk123
3   patrick   murphy     [email protected]    6548  north street          NaN
4    barack    obama    [email protected]    2135       florida          NaN

You can see in the expected output that jack ryan has not been appended to the new dataframe, as the email was checked before appending the data.

Upvotes: 2

Views: 1591

Answers (2)

llalwani11
llalwani11

Reputation: 106

You can try something like:

df1_unique = df1['email'].unique()
df1 = df1.append(df2.loc[~df2['email'].isin(df1_unique), :])
df1

Output:

    firstname   lastname    email               bank_account
0   jack        ryan        [email protected]    abcd123
1   john        obrien      [email protected]   jhkf123
2   donald      trump       [email protected]   kdlk123
1   patrick     murphy      [email protected]      NaN
2   barak       obama       [email protected]     NaN

Upvotes: 1

harpan
harpan

Reputation: 8631

You simply need to concat and then use drop-duplicates

pd.concat([df1,df2], ignore_index=True).drop_duplicates('email')

Output:

      firstname lastname              email  mobile       address bank_account
0      jack     ryan   [email protected]   12346   main street      abcd123
1      john   obrien  [email protected]   51234   high street      jhkf123
2    donald    trump  [email protected]   54856   white house      kdlk123
3   patrick   murphy     [email protected]    6548  north street          NaN
4    barack    obama    [email protected]    2135       florida          NaN

Upvotes: 4

Related Questions