ygesher
ygesher

Reputation: 1161

Performance optimization on Django update or create

In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.

Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.

So for now I do:

for row in csv_data:
    try:
        MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
    except:
        print 'error!'

And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.

Thanks for any help.

Upvotes: 9

Views: 11114

Answers (4)

Yariv Katz
Yariv Katz

Reputation: 1362

The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change. To elaborate:

  1. grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
  1. Grab all the api data you need to insert or update
api_data = [...]
  1. for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
    for row in api_data:
        if is_new_row(row, old_data):
            new_rows_array.append(row)
        else:
            if is_data_modified(row, old_data):
                ...
                # do the update
            else:
                continue
     MyModel.objects.bulk_create(new_rows_array)

is_new_row - will understand if the row is new and add it to an array that will be bulk created

is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed

Upvotes: 8

saran3h
saran3h

Reputation: 14022

Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.

I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.

Upvotes: 3

lasthuman
lasthuman

Reputation: 1

I think below code can do the same thing together instead of update_or_create:

MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()

Upvotes: -2

Kevin Christopher Henry
Kevin Christopher Henry

Reputation: 48952

If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.

But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:

updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
    MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])

(Note that this simplified example has a race condition, see the Django source for how to deal with it.)

In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.

Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.

Upvotes: 3

Related Questions