Reputation: 64709
How do you efficiently iterate over a large queryset (records in the millions) with Django?
I'm trying to delete several million records that I can't do with a simple bulk SQL DELETE statement because the transaction would consume too much server memory. So I'm attempting to write a Python script to group ~10000 individual DELETE statements in a transaction.
My script looks like:
from django.db import transaction
from django.conf import settings
settings.DEBUG = False
qs = MyModel.objects.filter(some_criteria=123)
total = qs.count()
i = 0
transaction.set_autocommit(False)
for record in qs.iterator():
i += 1
if i == 1 or not i % 100 or i == total:
print('%i of %i %.02f%%: %s' % (i + 1, total, (i + 1) / float(total) * 100, record))
record.delete()
if not i % 1000:
transaction.commit()
transaction.commit()
This runs fine for the first 2000 records, but then errors with:
Traceback (most recent call last):
File "/project/.env/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1512, in cursor_iter
for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel):
File "/project/.env/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1512, in <lambda>
for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel):
File "/project/.env/lib/python3.7/site-packages/django/db/utils.py", line 96, in inner
return func(*args, **kwargs)
File "/project/.env/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/project/.env/lib/python3.7/site-packages/django/db/utils.py", line 96, in inner
return func(*args, **kwargs)
django.db.utils.ProgrammingError: named cursor isn't valid anymore
How do I fix this error? I'm using PostgreSQL as my db backend.
Google finds very few have run into this error, and mentions no solutions specifically for Django. I've read that adding hold=True
to the cursor call my fix it, but it's unclear how to set that through Django's ORM.
I've attempted to add a try/except to catch and continue the query, but that hasn't worked. As the traceback doesn't even include my code, I'm not specifically sure which line is triggering the exception.
Upvotes: 0
Views: 1361
Reputation: 2088
This error occurs after 2000 record because this is the default chunk_size
for iterator. After these are deleted, the iterator doesn't know where to continue from.
I would use pagination for this task.
from django.core.paginator import Paginator
from django.db.models import Subquery
qs = MyModel.objects.filter(some_criteria=123)
paginator = Paginator(qs, 10000)
for page_num in reversed(paginator.page_range):
MyModel.objects.filter(
pk__in=Subquery(paginator.page(page_num).object_list.values('pk'))
).delete()
Here we go over the pages in reverse order to avoid issues similar to the one you get with iterator
.
We can't call delete
directly on the object_list
because it's not allowed with sliced querysets, so we get the pk
s of the objects on the page and filter against them before deleting. The Subquery
saves us from one additional query per iteration.
Upvotes: 2