qshng
qshng

Reputation: 887

Python Pandas-Update a data frame with values from another, without replacing existing

I'm having an issue with updating a data frame when I already have values in the column.

Here is an example

import pandas as pd
df = pd.DataFrame({
                  'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                  'Name': ['John', 'Sam',None,None],
                  'id': ['A0', 'A1','A2', 'A3'], }
                     )
df

    df
        Name    email   id
    0   John    [email protected] A0
    1   Sam     [email protected] A1
    2   None    [email protected] A2
    3   None    [email protected] A3

ref_df = pd.DataFrame({
                  'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                  'Name': ['', 'Sam','Tim','Sara'],
                  'random': ['f', 's','r', 'a'], }
                     )
ref_df
Name           email      random
0           [email protected]     f
1   Sam     [email protected]     s
2   Tim     [email protected]     r
3   Sara    [email protected]     a

The result I want is below:

Name           email    id
0   John    [email protected] A0
1   Sam     [email protected] A1
2   Tim     [email protected] A2
3   Sara    [email protected] A3

I want to populate the Name with values in ref_df based on email, but keep the existing values. Only update null values in name. Also only keep the original columns in df.(get rid of the random columns in ref_df)

I also want to be able to do this repeatedly, because I want to update df with multiple ref_df from different sources.

below is what I have tried, this works if I run the code line by line, but once I wrap it in a function, I got a keyerror.

I'm sure there is a better way for doing this. Any help is appreciated!

def update_df(df, index, ref_df, ref_cols,how='inner',left_on=None,
              right_on=None,):
    df = init_columns(df, cols=ref_cols)
    cols_to_keep = list(df.columns)
    gap_cols = df.columns.difference(ref_df.columns)
    gap_df = merge(
        df[gap_cols],
        ref_df,
        how,
        left_on,
        right_on,
    )
    gap_df = gap_df[cols_to_keep].set_index(index)
    df = df.set_index(index)
    df.update(gap_df)
    df=df[cols_to_keep]
    return df

Upvotes: 0

Views: 267

Answers (2)

jpp
jpp

Reputation: 164623

This should work:

df['Name'] = df['Name'].fillna(df['email'].map(ref_df.set_index('email')['Name']))

The way this works is to create an email to Name mapping from ref_df, then use it to fill blanks in your dataframe.

Upvotes: 0

Yilun Zhang
Yilun Zhang

Reputation: 9018

What I did is to convert your ref_df as a dictionary so that we can apply mapping.

ref_dict = dict(zip(ref_df["email"], ref_df["Name"]))
ref_dict

This will give you:

{'[email protected]': 'John',
 '[email protected]': 'Sam',
 '[email protected]': 'Tim',
 '[email protected]': 'Sara'}

Then, you can:

df["Name"] = df["email"].map(ref_dict)

And you will have:

    Name          email id
0   John    [email protected] A0
1   Sam     [email protected] A1
2   Tim     [email protected] A2
3   Sara    [email protected] A3

This will recreate the Name column and if you are afraid that this might change some of the existing values, you can only fill those that are NAs.

Upvotes: 1

Related Questions