Hadron Tungsten
Hadron Tungsten

Reputation: 157

Why is pandas throwing "Data overlaps." when I use pd.update?

I have my main dataframe, df_main, which is shaped like this:

df_main = pd.DataFrame({
    'GroupID': ([1]*7) + ([2]*2) + ([3]*6),
    'GroupName': (['String 1']*7) + (['String 2']*2) + (['String 3']*6),
    'FirstName': (['Matthew']*7) + (['Mark']*2) + (['Luke']*6),
    'LastName': (['Smith']*7) + (['Jones']*2) + (['Roberts']*6),
    'StartDate': (['2020-01-01']*7) + (['1998-01-01']*2) + (['N/A']*6),
}).replace('N/A',np.NaN)

df_main

I have a secondary dataframe, df_update, that I want to use to update the main dataframe. This dataframe has the missing StartDate for Luke Roberts:

df_update = pd.DataFrame({
    'GroupID': [1, 2, 3],
    'GroupName': ['String 1', 'String 2', 'String 3'],
    'FirstName': ['Matthew', 'Mark', 'Luke'],
    'LastName': ['Smith', 'Jones', 'Roberts'],
    'StartDate': ['2020-01-01', '1998-01-01', '2005-01-01'],
})

df_update

I've set identical indices for both dataframes:

df_main = df_main.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])
df_update = df_update.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])

df_main with MultiIndex df_update with MultiIndex

I try to update df_main:

df_main.update(df_update, overwrite=False, errors='raise')

But it doesn't work:

ValueError: Data overlaps.

Why is that, and how can I complete this update elegantly?

EDIT: This is Python 3.6.7 and Pandas 0.25.0.

Upvotes: 3

Views: 429

Answers (1)

jezrael
jezrael

Reputation: 862691

For me working your solution in pandas 1.1.3:

df_main.update(df_update)
print (df_main)
                                       StartDate
GroupID GroupName FirstName LastName            
1       String 1  Matthew   Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
2       String 2  Mark      Jones     1998-01-01
                            Jones     1998-01-01
3       String 3  Luke      Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01

If need replace only missing values use DataFrame.fillna:

df_main = df_main.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])
df_update = df_update.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])
df = df_main.fillna(df_update)
print (df)
                                       StartDate
GroupID GroupName FirstName LastName            
1       String 1  Matthew   Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
2       String 2  Mark      Jones     1998-01-01
                            Jones     1998-01-01
3       String 3  Luke      Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01
                            Roberts   2005-01-01

If change data, is possible see difference of solutions:

df_update = pd.DataFrame({
    'GroupID': [1, 2, 3],
    'GroupName': ['String 1', 'String 2', 'String 3'],
    'FirstName': ['Matthew', 'Mark', 'Luke'],
    'LastName': ['Smith', 'Jones', 'Roberts'],
    'StartDate': ['1990-01-01', '1991-01-01', '1992-01-01'],
})

df_main = df_main.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])
df_update = df_update.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])
df_main.update(df_update)
print (df_main)
                                       StartDate
GroupID GroupName FirstName LastName            
1       String 1  Matthew   Smith     1990-01-01
                            Smith     1990-01-01
                            Smith     1990-01-01
                            Smith     1990-01-01
                            Smith     1990-01-01
                            Smith     1990-01-01
                            Smith     1990-01-01
2       String 2  Mark      Jones     1991-01-01
                            Jones     1991-01-01
3       String 3  Luke      Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01

df_main = df_main.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])
df_update = df_update.set_index(['GroupID', 'GroupName', 'FirstName', 'LastName'])
df = df_main.fillna(df_update)
print (df)
                                       StartDate
GroupID GroupName FirstName LastName            
1       String 1  Matthew   Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
                            Smith     2020-01-01
2       String 2  Mark      Jones     1998-01-01
                            Jones     1998-01-01
3       String 3  Luke      Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01
                            Roberts   1992-01-01

Upvotes: 1

Related Questions