Cerin
Cerin

Reputation: 64709

Iterating over large querysets in Django

How do you efficiently iterate over a large queryset (records in the millions) with Django?

I'm trying to delete several million records that I can't do with a simple bulk SQL DELETE statement because the transaction would consume too much server memory. So I'm attempting to write a Python script to group ~10000 individual DELETE statements in a transaction.

My script looks like:

from django.db import transaction
from django.conf import settings

settings.DEBUG = False

qs = MyModel.objects.filter(some_criteria=123)
total = qs.count()
i = 0
transaction.set_autocommit(False)
for record in qs.iterator():
    i += 1
    if i == 1 or not i % 100 or i == total:
        print('%i of %i %.02f%%: %s' % (i + 1, total, (i + 1) / float(total) * 100, record))
    record.delete()
    if not i % 1000:
        transaction.commit()
transaction.commit()

This runs fine for the first 2000 records, but then errors with:

Traceback (most recent call last):
  File "/project/.env/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1512, in cursor_iter
    for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel):
  File "/project/.env/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1512, in <lambda>
    for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel):
  File "/project/.env/lib/python3.7/site-packages/django/db/utils.py", line 96, in inner
    return func(*args, **kwargs)
  File "/project/.env/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/project/.env/lib/python3.7/site-packages/django/db/utils.py", line 96, in inner
    return func(*args, **kwargs)
django.db.utils.ProgrammingError: named cursor isn't valid anymore

How do I fix this error? I'm using PostgreSQL as my db backend.

Google finds very few have run into this error, and mentions no solutions specifically for Django. I've read that adding hold=True to the cursor call my fix it, but it's unclear how to set that through Django's ORM.

I've attempted to add a try/except to catch and continue the query, but that hasn't worked. As the traceback doesn't even include my code, I'm not specifically sure which line is triggering the exception.

Upvotes: 0

Views: 1361

Answers (1)

4140tm
4140tm

Reputation: 2088

This error occurs after 2000 record because this is the default chunk_size for iterator. After these are deleted, the iterator doesn't know where to continue from.

I would use pagination for this task.

from django.core.paginator import Paginator
from django.db.models import Subquery

qs = MyModel.objects.filter(some_criteria=123)
paginator = Paginator(qs, 10000)

for page_num in reversed(paginator.page_range):
    MyModel.objects.filter(
        pk__in=Subquery(paginator.page(page_num).object_list.values('pk'))
    ).delete()

Here we go over the pages in reverse order to avoid issues similar to the one you get with iterator.

We can't call delete directly on the object_list because it's not allowed with sliced querysets, so we get the pks of the objects on the page and filter against them before deleting. The Subquery saves us from one additional query per iteration.

Upvotes: 2

Related Questions