zephyr707
zephyr707

Reputation: 173

How to update a dataframe from a new dataframe that has an additional column?

In the help page for pandas.DataFrame.update they have the following example to update df with new_df:

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [400, 500, 600]})
new_df = pd.DataFrame({'B': [4, 5, 6],
                       'C': [7, 8, 9]})
df.update(new_df)
df
   A  B
0  1  4
1  2  5
2  3  6

What is the best way to get 'C' to show up as a new column in df using the update command? I could not figure it out, but if I create the new column in df using either of the following:

df['C'] = np.NaN
df['C'] = 0

that seems to populate it, but I'm wondering if there is a better way or some way to do it in the df.update.

Upvotes: 3

Views: 673

Answers (3)

SeaBean
SeaBean

Reputation: 23217

This answer is aimed at discussing the differences of DataFrame.update() with various similar functions, such as .combine_first() and .assign() discussed in this post.

An example is sometimes better than thousand words.

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [400, 500, np.nan]})
new_df = pd.DataFrame({'B': [4, np.nan, 6],
                       'C': [7, 8, 9]})

df
   A      B
0  1  400.0
1  2  500.0
2  3    NaN

new_df
     B  C
0  4.0  7
1  NaN  8
2  6.0  9

Using .update():

df.update(new_df)
   A      B
0  1    4.0
1  2  500.0              <==  500 from column B of df is NOT replaced by NaN from new_df
2  3    6.0

Note also that df after the update (in place) still contains 2 columns only.

Using .combine_first()

df.combine_first(new_df)
   A      B  C
0  1  400.0  7
1  2  500.0  8
2  3    6.0  9           <==  Only NaN entry in column B of df got replaced

Note that we have 3 columns in the result here.

Next, let's try reversing the subject and object of the call:

new_df.combine_first(df)
   A      B  C
0  1    4.0  7
1  2  500.0  8           <==  Only NaN entry in column B of new_df got replaced
2  3    6.0  9

Note that we have 3 columns in the result here. Different from the result of .update()

This is among all the nearest to the result of .update() but still with one column more. Moreover, we have to reverse the subject and object in the call: we want to update df but have to call using new_df to get similar result. This is not so intuitive.

Using .assign()

df.assign(**new_df)
   A    B  C
0  1  4.0  7
1  2  NaN  8             <==  500 from column B of df got replaced by NaN from new_df
2  3  6.0  9

Note that we have 3 columns in the result here. Different from the result of .update()

How to get column C to show up as a new column in df using .update() command?

Since only values at matching index/column labels are updated, we can achieve this by using .reindex() on the union of both row and column indexes before using .update() command, as follows:

df = df.reindex(index=df.index.union(new_df.index), 
                columns=df.columns.union(new_df.columns))
df.update(new_df)
   A      B    C
0  1    4.0  7.0
1  2  500.0  8.0
2  3    6.0  9.0

This works better than manually creating missing columns and rows in df for there could be multiple row and column indexes to add for large datasets.

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133458

There is another way where we could use combine_first and make sure right DataFrames are passed to it. If you pass it correct manner then you will get expected output.

new_df.combine_first(df)
    A   B   C
0   1   4   7
1   2   5   8
2   3   6   9

Upvotes: 4

Ch3steR
Ch3steR

Reputation: 20669

df.update Aligns on indices. You can use df.assign to overwrite the previous value and adding new columns too.

df.assign(**new_df)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

Upvotes: 4

Related Questions