Rizhiy
Rizhiy

Reputation: 854

Efficiently extract large amounts of data from Django ORM

I have a Django setup with PostgreSQL DB. One of the tables contains large amounts of data (>1e9 rows) and I need to efficiently iterate over a large subset of it.

Currently, when I try to select a large amount of data it starts buffering the result and my computer runs out of memory. If I use .iterator() on QuerySet, it just hangs. If I try to use raw SQL and fetchall() it also starts buffering.

I believe Django uses psycopg2 for PostgreSQL which has cursor.itersize parameter, but when I try to use it with cursor in Django it doesn't do anything.

I know that the problem is not on database end, since I can execute the query using psql (with -A --variable="FETCH_COUNT=10000") and it starts loading immediately without using any memory.

Extra info:

EDIT: Using psycopg2 server-side cursor seems to work, but is slower and ugly: How can I use server-side cursors with django and psycopg2?

EDIT 2: Here is code which is working for me now, but which is quite ugly:

def get_stuff():
    def fetch_from_server_cursor(cursor, cursor_name, fetch_size=10_000):
        while True:
            cursor.execute(f"FETCH {fetch_size} FROM {cursor_name}")
            chunk = cursor.fetchall()
            if not chunk:
                return
            yield from chunk

    with transaction.atomic(), connection.cursor() as cursor:
        cursor_name = "my_cursor"
        cursor.execute(
            f"""
            DECLARE {cursor_name} CURSOR FOR
            SELECT first_column, second_column
            FROM {MyModel.objects.model._meta.db_table}
            """
        )
        yield from fetch_from_server_cursor(cursor, cursor_name)

EDIT 3: Here is Django model, N.B. I'm using Timescale on the table in DB which automatically creates index on TimeScaleDateTimeField:

class MyModel(models.Model):
    first_column = models.IntegerField()
    second_column = models.TimeScaleDateTimeField()
    third_column = models.URLField(null=True, blank=True)
    ...

    class Meta:
        ordering = ("second_column",)

Upvotes: 2

Views: 3284

Answers (2)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477891

The table has [more than] 10 columns, but I only need 2 of them, so if it is possible to only fetch selected for faster loading it would be nice.

You can do this with .only(…) [Django-doc] to only select a subset of the columns, so for example:

for item in MyModel.objects.only('pk', 'other_column').iterator():
    print((item.pk, item.other_column))

This will reduce some bandwidth between the database, and the application layer. But regardless, 109 items is typically not really feasible in Python. If we for example simply sum up items of such range (sum(range(1000000000))), it will take ≈ 14 seconds, but that is a very simple generator. Django will for each record read the content from the database, create a model object, and set the fields accordingly, so this can easily take minutes, or even hours.

Upvotes: 3

greg
greg

Reputation: 1121

This is caused by Django passing a WITH HOLD to its DECLARE CURSOR statement. By default a server-side cursor's results are unavailable at the end of a transaction but Django wants them to be available to subsequent transactions because otherwise its default auto-commit mode would look silly, so it adds WITH HOLD. This turns the cursor from a lightweight construct, skipping and bookmarking its way through a one-way stream of results, into a fully-realised copy of the results saved to a temporary area on disk for who-knows how long.

To support WITH HOLD, PostgreSQL materialises all the results when Django auto-commits the DECLARE CURSOR transaction, i.e. before being asked to return any initial rows. For large results, this can appear like it's hanging at the DECLARE CURSOR phase: it's not, but for billions of rows it can take a long time (and tie-up server resources) depending on the speed of the server disks, config etc.

The point of using an iterator, and so a server-side cursor, is to read a large result set without Django or the database driver trying to cache it all in memory. To instead persist everything to the server disk and then read it back again is a huge waste of time and server resources, usually on huge queries where it's least wanted or needed, and often going unnoticed.

The fix is to disable Django's auto-commit mode and that prevents it from passing the WITH HOLD, e.g. instead of

for obj in MyModel.objects.all().iterator(chunk_size=10000):
    ...

use:

from django.db import transaction

with transaction.atomic():  # don't auto-commit to avoid WITH HOLD to avoid first materialising all the cursor results on the server
    for obj in MyModel.objects.all().iterator(chunk_size=10000):
        ...

If your query is ordered or grouped then it could still have an initial delay while the database sorts it before returning the first results (unless it uses an index to do the sorting/grouping).

The cursor should now be quick to declare, fetching should start almost immediately, cancelling it part-way through is quick and wastes nothing, and your server will be spared a lot of wasted resources.

Upvotes: 1

Related Questions