Reputation: 4419
I have two pandas dataframes and one has updated values for a subset of the values for the primary dataframe. The main one is ~2m rows and the column to update is ~20k. This operation is running extremely slowly as I have it below which is O(m*n) as far as I can tell, is there a good way to vectorize it or just generally increase the speed? I don't see how many other optimizations can apply to this case. I have also tried making the 'object_id' column the index but that didn't lead to a meaningful increase in speed.
# df_primary this is 2m rows
# df_updated this is 20k rows
for idx, row in df_updated.iterrows():
df_primary.loc[df_primary.object_id == row.object_id, ['status', 'category']] = [row.status, row.category]
Upvotes: 1
Views: 117
Reputation: 71689
Let's try DataFrame.update
to update df_primary
in place using values from df_updated
:
df_primary = df_primary.set_index('object_id')
df_primary.update(df_updated.set_index('object_id')[['status', 'category']])
df_primary = df_primary.reset_index()
Upvotes: 1
Reputation: 1
use join methods based on requirements like left/right/inner joins. It will be super fast than any other way.
Upvotes: 0