Reputation: 15
I have a table in Postgres Thing
that has 100 Million rows.
I have a column that was populated over time that stores some keys. The keys were prefixed before storing. Let's call it prefixed_keys
.
My task is to use the values of this column to populate another column with the same values but with the prefixes trimmed off. Let's call it simple_keys
.
I tried the following migration:
from django.db import migrations
import time
def backfill_simple_keys(apps, schema_editor):
Thing = apps.get_model('thing', 'Thing')
batch_size = 100000
number_of_batches_completed = 0
while Thing.objects.filter(simple_key__isnull=True).exists():
things = Thing.objects.filter(simple_key__isnull=True)[:batch_size]
for tng in things:
prefixed_key = tng.prefixed_key
if prefixed_key.startswith("prefix_A"):
simple_key = prefixed_key[len("prefix_A"):]
elif prefixed_key.startswith("prefix_BBB"):
simple_key = prefixed_key[len("prefix_BBB"):]
tng.simple_key = simple_key
Thing.objects.bulk_update(
things,
['simple_key'],
batch_size=batch_size
)
number_of_batches_completed += 1
print("Number of batches updated: ", number_of_batches_completed)
sleep_seconds = 3
time.sleep(sleep_seconds)
class Migration(migrations.Migration):
dependencies = [
('thing', '0030_add_index_to_simple_key'),
]
operations = [
migrations.RunPython(
backfill_simple_keys,
),
]
Each batch took about ~7 minutes to complete. Which would means it would take days to complete! It also increased the latency of the DB which is bing used in production.
Upvotes: 0
Views: 2152
Reputation: 2789
Since you're going to go through every record in that table anyway it makes sense to traverse it in one go using a server-side cursor.
Calling
Thing.objects.filter(simple_key__isnull=True)[:batch_size]
is going to be expensive especially as the index starts to grow.
Also the call above retrieves ALL fields from that table even if you are only going to use only 2-3 fields.
update_query = """UPDATE table SET simple_key = data.key
FROM (VALUES %s) AS data (id, key) WHERE table.id = data.id"""
conn = psycopg2.connect(DSN, cursor_factory=RealDictCursor)
cursor = conn.cursor(name="key_server_side_crs") # having a name makes it a SSC
update_cursor = conn.cursor() # regular cursor
cursor.itersize = 5000 # how many records to retrieve at a time
cursor.execute("SELECT id, prefixed_key, simple_key FROM table")
count = 0
batch = []
for row in cursor:
if not row["simple_key"]:
simple_key = calculate_simple_key(row["prefixed_key"])
batch.append[(row["id"], simple_key)]
if len(batch) >= 1000 # how many records to update at once
execute_values(update_cursor, update_query, batch, page_size=1000)
batch = []
time.sleep(0.1) # allow the DB to "breathe"
count += 1
if count % 100000 == 0: # print progress every 100K rows
print("processed %d rows", count)
The above is NOT
tested so it's advisable to create a copy of a few million rows of the table and test it against it first.
You can also test various batch size settings (both for retrieve and update).
Upvotes: 1