Update date-frame with another data-frame which as a different number of columns

Question

I have a large df called data which looks like:

     Identifier     Surname  First names(s)  Date change  Work Pattern  Region 
0       12233.0       Smith             Bob                        FT      NW
1       54213.0       Jones           Sally     15/04/15           FT      NW
2       12237.0       Evans           Steve     26/08/14           FT      SE
3       10610.0      Cooper             Amy     16/08/12           FT      SE

I have another dataframe called updates. In this example the dataframe has updated information for data for a couple of records and looks like:

     Identifier     Surname  First names(s)   Date change 
0       12233.0       Smith             Bob      05/09/14                  
1       10610.0      Cooper             Amy      16/08/12

I'm trying to find a way to update data with the updates df so the resulting dataframe looks like:

     Identifier     Surname  First names(s)  Date change  Work Pattern  Region 
0       12233.0       Smith             Bob     15/09/14           FT      NW
1       54213.0       Jones           Sally     15/04/15           FT      NW
2       12237.0       Evans           Steve     26/08/14           FT      SE
3       10610.0      Cooper             Amy     16/08/12           FT      SE

As you can see the Date change field for Bob in the data df has been updated with the Date change from the updates df.

What can I try next?

chitown88 · Accepted Answer

a while back, I was dealing with that too. the straight up .update was giving me issues (sorry can't remember the exact issue I had. I think it was that when you do .update, it's reliant on indexes matching, and they didn't match in my 2 separate dataframes. so I wanted to use certain columns as my index to update on),

But I made a function to deal with it. So this might be way overkill than what's needed but try this and see if it'll work.

I'm also assuming the date you want update from the updates dataframe should be 15/09/14 not 05/09/14. So I had that different in my sample data below

Also, I'm assuming the Identifier is unique key. If not, you'll need to include multiple columns as your unique key

import sys
import pandas as pd


data = pd.DataFrame([[12233.0,'Smith','Bob','','FT','NW'],
                    [54213.0,'Jones','Sally','15/04/15','FT','NW'],
                    [12237.0,'Evans','Steve','26/08/14','FT','SE'],
                    [10610.0,'Cooper','Amy','16/08/12','FT','SE']], 
                    columns = ['Identifier','Surname','First names(s)','Date change','Work Pattern','Region'])



updates = pd.DataFrame([[12233.0,'Smith','Bob','15/09/14'],
                    [10610.0,'Cooper','Amy','16/08/12']],
                    columns = ['Identifier','Surname','First names(s)','Date change'])




def update(df1, df2, keys_list):
    df1 = df1.set_index(keys_list)
    df2 = df2.set_index(keys_list)

    dup_idx1 = df1.index.get_duplicates()
    dup_idx2 = df2.index.get_duplicates()

    if len(dup_idx1) > 0 or len(dup_idx2) > 0:
        print('
'+'#'*50+'
Error! Duplicate Indicies:')
        for element in dup_idx1:
            print('df1: %s' %(element,))
        for element in dup_idx2:
            print('df2: %s' %(element,))
        print('#'*50+'

')

    df1.update(df2, overwrite=True)
    df1.reset_index(inplace=True)
    df2.reset_index(inplace=True)
    return df1


# the 3rd input is a list, in case you need multiple columns as your unique key
df = update(data, updates, ['Identifier'])

Output:

print (data)
   Identifier Surname First names(s) Date change Work Pattern Region
0     12233.0   Smith            Bob                       FT     NW
1     54213.0   Jones          Sally    15/04/15           FT     NW
2     12237.0   Evans          Steve    26/08/14           FT     SE
3     10610.0  Cooper            Amy    16/08/12           FT     SE


print (updates)
   Identifier Surname First names(s) Date change
0     12233.0   Smith            Bob    15/09/14
1     10610.0  Cooper            Amy    16/08/12

df = update(data, updates, ['Identifier'])

In [19]: print (df)
   Identifier Surname First names(s) Date change Work Pattern Region
0     12233.0   Smith            Bob    15/09/14           FT     NW
1     54213.0   Jones          Sally    15/04/15           FT     NW
2     12237.0   Evans          Steve    26/08/14           FT     SE
3     10610.0  Cooper            Amy    16/08/12           FT     SE

Update date-frame with another data-frame which as a different number of columns

Answers (2)

Related Questions