Reputation: 1161
In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create
operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
Upvotes: 9
Views: 11114
Reputation: 1362
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change. To elaborate:
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
api_data = [...]
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row
- will understand if the row is new and add it to an array that will be bulk created
is_data_modified
- will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
Upvotes: 8
Reputation: 14022
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
Upvotes: 3
Reputation: 1
I think below code can do the same thing together instead of update_or_create
:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()
Upvotes: -2
Reputation: 48952
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get()
followed by a save()
, or a get()
followed by a create()
). It does things this way to maximize internal consistency - for example, this ensures that your model's save()
method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save()
method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update()
followed by a bulk_create()
if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Upvotes: 3