Reputation: 3375
I have two dataframes of customer information:
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':['ryan','obrien','trump'],
'email':['[email protected]','[email protected]','[email protected]'],
'bank_account':['abcd123','jhkf123','kdlk123']})
print(df1)
firstname lastname email bank_account
0 jack ryan [email protected] abcd123
1 john obrien [email protected] jhkf123
2 donald trump [email protected] kdlk123
df2 = pd.DataFrame({'firstname':['jack','patrick','barak'],
'lastname':['ryan','murphy','obama'],
'email':['[email protected]','[email protected]','[email protected]'],
'bank_account':[pd.np.nan]*3})
print(df2)
firstname lastname email bank_account
0 jack ryan [email protected] NaN
1 patrick murphy [email protected] NaN
2 barak obama [email protected] NaN
I want to insert the records from df2
into df1
but only if they are not present in df1
.
For example we can see that jack ryan
is present in df2
and df1
, so i don't want him to be inserted into the df1
.
The primary key
in this situation can be the email. If the email exists in df1
, do not insert the record.
I've been experimenting and googling with pd.concat
for the last while, setting email
as the index etc. and can't get the result I want, which is this:
firstname lastname email mobile address bank_account
0 jack ryan [email protected] 12346 main street abcd123
1 john obrien [email protected] 51234 high street jhkf123
2 donald trump [email protected] 54856 white house kdlk123
3 patrick murphy [email protected] 6548 north street NaN
4 barack obama [email protected] 2135 florida NaN
You can see in the expected output that jack ryan
has not been appended to the new dataframe, as the email
was checked before appending the data.
Upvotes: 2
Views: 1591
Reputation: 106
You can try something like:
df1_unique = df1['email'].unique()
df1 = df1.append(df2.loc[~df2['email'].isin(df1_unique), :])
df1
Output:
firstname lastname email bank_account
0 jack ryan [email protected] abcd123
1 john obrien [email protected] jhkf123
2 donald trump [email protected] kdlk123
1 patrick murphy [email protected] NaN
2 barak obama [email protected] NaN
Upvotes: 1
Reputation: 8631
You simply need to concat
and then use drop-duplicates
pd.concat([df1,df2], ignore_index=True).drop_duplicates('email')
Output:
firstname lastname email mobile address bank_account
0 jack ryan [email protected] 12346 main street abcd123
1 john obrien [email protected] 51234 high street jhkf123
2 donald trump [email protected] 54856 white house kdlk123
3 patrick murphy [email protected] 6548 north street NaN
4 barack obama [email protected] 2135 florida NaN
Upvotes: 4