Fastest way to update pandas columns based on matching column from other pandas dataframe

Question

I have two pandas dataframes and one has updated values for a subset of the values for the primary dataframe. The main one is ~2m rows and the column to update is ~20k. This operation is running extremely slowly as I have it below which is O(m*n) as far as I can tell, is there a good way to vectorize it or just generally increase the speed? I don't see how many other optimizations can apply to this case. I have also tried making the 'object_id' column the index but that didn't lead to a meaningful increase in speed.

# df_primary this is 2m rows
# df_updated this is 20k rows

for idx, row in df_updated.iterrows():
    df_primary.loc[df_primary.object_id == row.object_id, ['status', 'category']] = [row.status, row.category]

Shubham Sharma · Accepted Answer

Let's try DataFrame.update to update df_primary in place using values from df_updated:

df_primary = df_primary.set_index('object_id')
df_primary.update(df_updated.set_index('object_id')[['status', 'category']])
df_primary = df_primary.reset_index()

Fastest way to update pandas columns based on matching column from other pandas dataframe

Answers (2)

Related Questions