Django Postgres migration: Fastest way to backfill a column in a table with 100 Million rows

I have a table in Postgres Thing that has 100 Million rows.

I have a column that was populated over time that stores some keys. The keys were prefixed before storing. Let's call it prefixed_keys.

My task is to use the values of this column to populate another column with the same values but with the prefixes trimmed off. Let's call it simple_keys.

I tried the following migration:

from django.db import migrations
import time


def backfill_simple_keys(apps, schema_editor):
    Thing = apps.get_model('thing', 'Thing')

    batch_size = 100000
    number_of_batches_completed = 0
    while Thing.objects.filter(simple_key__isnull=True).exists():
        things = Thing.objects.filter(simple_key__isnull=True)[:batch_size]
        for tng in things:
            prefixed_key = tng.prefixed_key
            if prefixed_key.startswith("prefix_A"):
                simple_key = prefixed_key[len("prefix_A"):]
            elif prefixed_key.startswith("prefix_BBB"):
                simple_key = prefixed_key[len("prefix_BBB"):]
            tng.simple_key = simple_key
        Thing.objects.bulk_update(
            things,
            ['simple_key'],
            batch_size=batch_size
        )
        number_of_batches_completed += 1
        print("Number of batches updated: ", number_of_batches_completed)
        sleep_seconds = 3
        time.sleep(sleep_seconds)

class Migration(migrations.Migration):

    dependencies = [
        ('thing', '0030_add_index_to_simple_key'),
    ]

    operations = [
         migrations.RunPython(
            backfill_simple_keys,
        ),
    ]

Each batch took about ~7 minutes to complete. Which would means it would take days to complete! It also increased the latency of the DB which is bing used in production.

Upvotes: 0

Views: 2152

Answers (1)

Ionut Ticus
Ionut Ticus

Reputation: 2789

Since you're going to go through every record in that table anyway it makes sense to traverse it in one go using a server-side cursor.
Calling
Thing.objects.filter(simple_key__isnull=True)[:batch_size]
is going to be expensive especially as the index starts to grow.
Also the call above retrieves ALL fields from that table even if you are only going to use only 2-3 fields.

update_query = """UPDATE table SET simple_key = data.key 
    FROM (VALUES %s) AS data (id, key) WHERE table.id = data.id"""
conn = psycopg2.connect(DSN, cursor_factory=RealDictCursor)
cursor = conn.cursor(name="key_server_side_crs")  # having a name makes it a SSC
update_cursor = conn.cursor()  # regular cursor
cursor.itersize = 5000  # how many records to retrieve at a time
cursor.execute("SELECT id, prefixed_key, simple_key FROM table")
count = 0
batch = []
for row in cursor:
    if not row["simple_key"]:
        simple_key = calculate_simple_key(row["prefixed_key"])
        batch.append[(row["id"], simple_key)]
    if len(batch) >= 1000  # how many records to update at once
        execute_values(update_cursor, update_query, batch, page_size=1000)
        batch = []
        time.sleep(0.1)  # allow the DB to "breathe"
    count += 1
    if count % 100000 == 0:  # print progress every 100K rows
        print("processed %d rows", count)

The above is NOT tested so it's advisable to create a copy of a few million rows of the table and test it against it first.
You can also test various batch size settings (both for retrieve and update).

Upvotes: 1

Related Questions