merge new data into old data by replacing old data while appending new rows

Question

I have 2 data frames with the same column names. Old data frame old_df and the new data frame is new_df with 1 column as a key.

I am trying to merge the 2 data frames into a single data frame which following conditions.

If the key is missing in the new table, then data from old_df should be taken
if the key is missing in old table, then data from new_df should be added.
If the key is present in both the tables then the data from new_df should overwrite the data from old_df.

Below is my code snippet that I am trying to play with.

new_data = pd.read_csv(filepath)
new_data.set_index(['Name'])
old_data = pd.read_sql_query("select * from dbo.Details", con=engine)
old_data.set_index(['Name'])
merged_result = pd.merge(new_data[['Name','RIC','Volatility','Sector']],
                         old_data,
                         on='Name',
                         how='outer')

I am thinking of using np.where from this point onwards but not sure how to proceed. please advice.

jezrael · Accepted Answer

I believe you need DataFrame.combine_first with DataFrame.set_index for match by Name columns:

merged_result = (new_data.set_index('Name')[['RIC','Volatility','Sector']]
                 .combine_first(old_data.set_index('Name'))
                 .reset_index())

Sample data:

old_data = pd.DataFrame({'RIC':range(6),
                         'Volatility':[5,3,6,9,2,4], 
                         'Name':list('abcdef')})

print (old_data)
   RIC  Volatility Name
0    0           5    a
1    1           3    b
2    2           6    c
3    3           9    d
4    4           2    e
5    5           4    f

new_data = pd.DataFrame({'RIC':range(4),
                         'Volatility':[10,20,30, 40], 
                         'Name': list('abhi')})

print (new_data)
   RIC  Volatility Name
0    0          10    a
1    1          20    b
2    2          30    h
3    3          40    i

merged_result = (new_data.set_index('Name')
                 .combine_first(old_data.set_index('Name'))
                 .reset_index())
print (merged_result)
  Name  RIC  Volatility
0    a  0.0        10.0
1    b  1.0        20.0
2    c  2.0         6.0
3    d  3.0         9.0
4    e  4.0         2.0
5    f  5.0         4.0
6    h  2.0        30.0
7    i  3.0        40.0

merge new data into old data by replacing old data while appending new rows

Answers (2)

Related Questions