Django Postgres migration: Fastest way to backfill a column in a table with 100 Million rows

Question

I have a table in Postgres Thing that has 100 Million rows.

I have a column that was populated over time that stores some keys. The keys were prefixed before storing. Let's call it prefixed_keys.

My task is to use the values of this column to populate another column with the same values but with the prefixes trimmed off. Let's call it simple_keys.

I tried the following migration:

from django.db import migrations
import time


def backfill_simple_keys(apps, schema_editor):
    Thing = apps.get_model('thing', 'Thing')

    batch_size = 100000
    number_of_batches_completed = 0
    while Thing.objects.filter(simple_key__isnull=True).exists():
        things = Thing.objects.filter(simple_key__isnull=True)[:batch_size]
        for tng in things:
            prefixed_key = tng.prefixed_key
            if prefixed_key.startswith("prefix_A"):
                simple_key = prefixed_key[len("prefix_A"):]
            elif prefixed_key.startswith("prefix_BBB"):
                simple_key = prefixed_key[len("prefix_BBB"):]
            tng.simple_key = simple_key
        Thing.objects.bulk_update(
            things,
            ['simple_key'],
            batch_size=batch_size
        )
        number_of_batches_completed += 1
        print("Number of batches updated: ", number_of_batches_completed)
        sleep_seconds = 3
        time.sleep(sleep_seconds)

class Migration(migrations.Migration):

    dependencies = [
        ('thing', '0030_add_index_to_simple_key'),
    ]

    operations = [
         migrations.RunPython(
            backfill_simple_keys,
        ),
    ]

Each batch took about ~7 minutes to complete. Which would means it would take days to complete! It also increased the latency of the DB which is bing used in production.

Django Postgres migration: Fastest way to backfill a column in a table with 100 Million rows

Answers (1)

Related Questions