Python Pandas-Update a data frame with values from another, without replacing existing

Question

I'm having an issue with updating a data frame when I already have values in the column.

Here is an example

import pandas as pd
df = pd.DataFrame({
                  'email':['1@dummy.com','2@dummy.com','3@dummy.com','4@dummy.com'],
                  'Name': ['John', 'Sam',None,None],
                  'id': ['A0', 'A1','A2', 'A3'], }
                     )
df

    df
        Name    email   id
    0   John    1@dummy.com A0
    1   Sam     2@dummy.com A1
    2   None    3@dummy.com A2
    3   None    4@dummy.com A3

ref_df = pd.DataFrame({
                  'email':['1@dummy.com','2@dummy.com','3@dummy.com','4@dummy.com'],
                  'Name': ['', 'Sam','Tim','Sara'],
                  'random': ['f', 's','r', 'a'], }
                     )
ref_df
Name           email      random
0           1@dummy.com     f
1   Sam     2@dummy.com     s
2   Tim     3@dummy.com     r
3   Sara    4@dummy.com     a

The result I want is below:

Name           email    id
0   John    1@dummy.com A0
1   Sam     2@dummy.com A1
2   Tim     3@dummy.com A2
3   Sara    4@dummy.com A3

I want to populate the Name with values in ref_df based on email, but keep the existing values. Only update null values in name. Also only keep the original columns in df.(get rid of the random columns in ref_df)

I also want to be able to do this repeatedly, because I want to update df with multiple ref_df from different sources.

below is what I have tried, this works if I run the code line by line, but once I wrap it in a function, I got a keyerror.

I'm sure there is a better way for doing this. Any help is appreciated!

def update_df(df, index, ref_df, ref_cols,how='inner',left_on=None,
              right_on=None,):
    df = init_columns(df, cols=ref_cols)
    cols_to_keep = list(df.columns)
    gap_cols = df.columns.difference(ref_df.columns)
    gap_df = merge(
        df[gap_cols],
        ref_df,
        how,
        left_on,
        right_on,
    )
    gap_df = gap_df[cols_to_keep].set_index(index)
    df = df.set_index(index)
    df.update(gap_df)
    df=df[cols_to_keep]
    return df

jpp · Accepted Answer

This should work:

df['Name'] = df['Name'].fillna(df['email'].map(ref_df.set_index('email')['Name']))

The way this works is to create an email to Name mapping from ref_df, then use it to fill blanks in your dataframe.

Python Pandas-Update a data frame with values from another, without replacing existing

Answers (2)

Related Questions