Reputation: 887
I'm having an issue with updating a data frame when I already have values in the column.
Here is an example
import pandas as pd
df = pd.DataFrame({
'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'Name': ['John', 'Sam',None,None],
'id': ['A0', 'A1','A2', 'A3'], }
)
df
df
Name email id
0 John [email protected] A0
1 Sam [email protected] A1
2 None [email protected] A2
3 None [email protected] A3
ref_df = pd.DataFrame({
'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'Name': ['', 'Sam','Tim','Sara'],
'random': ['f', 's','r', 'a'], }
)
ref_df
Name email random
0 [email protected] f
1 Sam [email protected] s
2 Tim [email protected] r
3 Sara [email protected] a
The result I want is below:
Name email id
0 John [email protected] A0
1 Sam [email protected] A1
2 Tim [email protected] A2
3 Sara [email protected] A3
I want to populate the Name
with values in ref_df based on email
, but keep the existing values. Only update null values in name
. Also only keep the original columns in df.(get rid of the random columns in ref_df)
I also want to be able to do this repeatedly, because I want to update df with multiple ref_df from different sources.
below is what I have tried, this works if I run the code line by line, but once I wrap it in a function, I got a keyerror.
I'm sure there is a better way for doing this. Any help is appreciated!
def update_df(df, index, ref_df, ref_cols,how='inner',left_on=None,
right_on=None,):
df = init_columns(df, cols=ref_cols)
cols_to_keep = list(df.columns)
gap_cols = df.columns.difference(ref_df.columns)
gap_df = merge(
df[gap_cols],
ref_df,
how,
left_on,
right_on,
)
gap_df = gap_df[cols_to_keep].set_index(index)
df = df.set_index(index)
df.update(gap_df)
df=df[cols_to_keep]
return df
Upvotes: 0
Views: 267
Reputation: 164623
This should work:
df['Name'] = df['Name'].fillna(df['email'].map(ref_df.set_index('email')['Name']))
The way this works is to create an email
to Name
mapping from ref_df
, then use it to fill blanks in your dataframe.
Upvotes: 0
Reputation: 9018
What I did is to convert your ref_df
as a dictionary so that we can apply mapping.
ref_dict = dict(zip(ref_df["email"], ref_df["Name"]))
ref_dict
This will give you:
{'[email protected]': 'John',
'[email protected]': 'Sam',
'[email protected]': 'Tim',
'[email protected]': 'Sara'}
Then, you can:
df["Name"] = df["email"].map(ref_dict)
And you will have:
Name email id
0 John [email protected] A0
1 Sam [email protected] A1
2 Tim [email protected] A2
3 Sara [email protected] A3
This will recreate the Name
column and if you are afraid that this might change some of the existing values, you can only fill those that are NA
s.
Upvotes: 1