jakes
jakes

Reputation: 2085

How to replace some rows in a dataframe with the corresponding rows from other data frame with additional columns

Let's assume I have two dataframes:

df1 = pd.DataFrame({'A': [0, 1, 2, 3],
                   'B': [300, 400, 500, 600], 
                   'C': [6, 7, 8, 9]})
df1.set_index('A', inplace=True)
df2 = pd.DataFrame({'A': [2, 3],
                    'B': [433, 555],
                    'C': [99, 99],
                    'D': [1, 2]})
df2.set_index('A', inplace=True)

I want to replace rows in df1 with rows in df2 based on index. My first attempts were:

df1.update(df2)
df1.loc[df2.index, :] = df2

However, it only replaces columns that excisted in df1 and doesn't add D column from df2. How can I add D column as well with NaN values for other rows?

Please note that my original data frames have hundreds of columns, and original df2 have many additional columns so any manual definitions of columns to be used/created will be enormously cumbersome.

Upvotes: 2

Views: 184

Answers (4)

WGP
WGP

Reputation: 748

You can use the combine function as follows.

df_new = df1.combine(df2, func=lambda s1, s2: s2.fillna(s1))

This is very similar to the combine_first method mentioned in a previous answer but gives you a bit more freedom as you can write the function you want to apply yourself.

Upvotes: 1

kederrac
kederrac

Reputation: 17322

you can use:

df1.update(df2)

cols = df2.columns[~ df2.columns.isin(df1.columns)]
mask = df2.index.isin(df1.index)
df1[cols] = df2[cols][mask]

df1

output:

enter image description here


without pandas.DataFrame.update:

df1_mask_index = df1.index.isin(df2.index)
df2_mask_index = df2.index.isin(df1.index)

same_cols = df1.columns[df1.columns.isin(df2.columns)]
diff_cols = df2.columns[~df2.columns.isin(df1.columns)]

df1[df1_mask_index] = df2[same_cols][df2_mask_index]
df1[diff_cols] = df2[diff_cols][df2_mask_index]

df1

output:

enter image description here

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

Let's use pd.DataFrame.combine_first:

df2.combine_first(df1)

Output:

       B     C    D
A                  
0  300.0   6.0  NaN
1  400.0   7.0  NaN
2  433.0  99.0  1.0
3  555.0  99.0  2.0

Upvotes: 3

jezrael
jezrael

Reputation: 862481

I think you need DataFrame.align with DataFrame.update:

df1 = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [300, 400, 500, 600], 'E': [6, 7, 8, 9]}) 
df1.set_index('A', inplace=True) 
df2 = pd.DataFrame({'A': [2, 5], 'B': [433, 555], 'C': [99, 99], 'D': [1, 2]}) 
df2.set_index('A', inplace=True)
print (df1)
     B  E
A        
0  300  6
1  400  7
2  500  8
3  600  9

print (df2)
     B   C  D
A            
2  433  99  1
5  555  99  2

df1, df2 = df1.align(df2)
print (df1)
       B   C   D    E
A                    
0  300.0 NaN NaN  6.0
1  400.0 NaN NaN  7.0
2  500.0 NaN NaN  8.0
3  600.0 NaN NaN  9.0
5    NaN NaN NaN  NaN

print (df2)
       B     C    D   E
A                      
0    NaN   NaN  NaN NaN
1    NaN   NaN  NaN NaN
2  433.0  99.0  1.0 NaN
3    NaN   NaN  NaN NaN
5  555.0  99.0  2.0 NaN

df1.update(df2)
print (df1)
       B     C    D    E
A                       
0  300.0   NaN  NaN  6.0
1  400.0   NaN  NaN  7.0
2  433.0  99.0  1.0  8.0
3  600.0   NaN  NaN  9.0
5  555.0  99.0  2.0  NaN

EDIT: Because sorting problem is possible change align to DataFrame.reindex by Index.union of columns and if necessary also by indexes:

df1 = pd.DataFrame({'E': [0, 1, 2, 3], 'A': [300, 400, 500, 600], 'C': [6, 7, 8, 9]}) 
df1.set_index('A', inplace=True) 
df2 = pd.DataFrame({'A': [2, 5], 'C': [433, 555], 'B': [99, 99], 'D': [1, 2]}) 
df2.set_index('A', inplace=True)

idx = df1.index.union(df2.index, sort=False)
cols = df1.columns.union(df2.columns, sort=False) 
df1 = df1.reindex(index=idx,columns=cols) 
df2 = df2.reindex(index=idx,columns=cols) 
print (df1)
       E    C   B   D
A                    
300  0.0  6.0 NaN NaN
400  1.0  7.0 NaN NaN
500  2.0  8.0 NaN NaN
600  3.0  9.0 NaN NaN
2    NaN  NaN NaN NaN
5    NaN  NaN NaN NaN

print (df2)
      E      C     B    D
A                        
300 NaN    NaN   NaN  NaN
400 NaN    NaN   NaN  NaN
500 NaN    NaN   NaN  NaN
600 NaN    NaN   NaN  NaN
2   NaN  433.0  99.0  1.0
5   NaN  555.0  99.0  2.0

df1.update(df2)
print (df1)
       E      C     B    D
A                         
300  0.0    6.0   NaN  NaN
400  1.0    7.0   NaN  NaN
500  2.0    8.0   NaN  NaN
600  3.0    9.0   NaN  NaN
2    NaN  433.0  99.0  1.0
5    NaN  555.0  99.0  2.0

Upvotes: 0

Related Questions