Reputation: 173
In the help page for pandas.DataFrame.update they have the following example to update df with new_df:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [400, 500, 600]})
new_df = pd.DataFrame({'B': [4, 5, 6],
'C': [7, 8, 9]})
df.update(new_df)
df
A B
0 1 4
1 2 5
2 3 6
What is the best way to get 'C' to show up as a new column in df using the update command? I could not figure it out, but if I create the new column in df using either of the following:
df['C'] = np.NaN
df['C'] = 0
that seems to populate it, but I'm wondering if there is a better way or some way to do it in the df.update.
Upvotes: 3
Views: 673
Reputation: 23217
This answer is aimed at discussing the differences of DataFrame.update()
with various similar functions, such as .combine_first()
and .assign()
discussed in this post.
An example is sometimes better than thousand words.
df = pd.DataFrame({'A': [1, 2, 3],
'B': [400, 500, np.nan]})
new_df = pd.DataFrame({'B': [4, np.nan, 6],
'C': [7, 8, 9]})
df
A B
0 1 400.0
1 2 500.0
2 3 NaN
new_df
B C
0 4.0 7
1 NaN 8
2 6.0 9
.update()
:df.update(new_df)
A B
0 1 4.0
1 2 500.0 <== 500 from column B of df is NOT replaced by NaN from new_df
2 3 6.0
Note also that df
after the update (in place) still contains 2 columns only.
.combine_first()
df.combine_first(new_df)
A B C
0 1 400.0 7
1 2 500.0 8
2 3 6.0 9 <== Only NaN entry in column B of df got replaced
Note that we have 3 columns in the result here.
Next, let's try reversing the subject and object of the call:
new_df.combine_first(df)
A B C
0 1 4.0 7
1 2 500.0 8 <== Only NaN entry in column B of new_df got replaced
2 3 6.0 9
Note that we have 3 columns in the result here. Different from the result of .update()
This is among all the nearest to the result of .update()
but still with one column more. Moreover, we have to reverse the subject and object in the call: we want to update df
but have to call using new_df
to get similar result. This is not so intuitive.
.assign()
df.assign(**new_df)
A B C
0 1 4.0 7
1 2 NaN 8 <== 500 from column B of df got replaced by NaN from new_df
2 3 6.0 9
Note that we have 3 columns in the result here. Different from the result of .update()
C
to show up as a new column in df
using .update()
command?Since only values at matching index/column labels are updated, we can achieve this by using .reindex()
on the union of both row and column indexes before using .update()
command, as follows:
df = df.reindex(index=df.index.union(new_df.index),
columns=df.columns.union(new_df.columns))
df.update(new_df)
A B C
0 1 4.0 7.0
1 2 500.0 8.0
2 3 6.0 9.0
This works better than manually creating missing columns and rows in df
for there could be multiple row and column indexes to add for large datasets.
Upvotes: 1
Reputation: 133458
There is another way where we could use combine_first
and make sure right DataFrames are passed to it. If you pass it correct manner then you will get expected output.
new_df.combine_first(df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Upvotes: 4