Reputation: 64769
I'm trying to design a bulk data import task using Django's ORM ontop of MySQL. Normally, I'd simply use LOAD DATA INFILE, but the data I'm bulk importing contains records over three tables, and some of the records may already exist, so I have to check for pre-existing records, create or retrieve their ID, and then use this ID when creating or retrieving the other records.
By default, the import rate is 0.8rows/sec. Quite horrible. I managed to bump this up to 1.5rows/sec by running DISABLE KEYS on the effected tables, but as I have a few million rows, this is still way too slow.
Is there any general advice for speeding up Django's ORM for bulk importing complex table relationships?
I was considering disabling Django's transaction management, in order to wrap the entire import in a single transaction. However, since the import takes so long, the import process periodically updates a status model to report percent completion. If I wrap the entire import in a single transaction, it won't be able to update this status record. So is there any way to disable transaction management for only a specific set of models, and still allow it to commit a separate model?
I'd like to do something like:
from django.db import transaction
transaction.enter_transaction_management()
transaction.managed(True)
from myapp.models import Status, Data
status = Status.objects.get(id=123)
try:
data = magically_get_data_iter()
for row in data:
d,_ = Data.objects.get_or_create(**data.keys())
d.update(data)
d.save() # not actually visible in admin until the commit below
if not row.i % 100:
status.current_row = row.i
status.total_rows = row.total
# obviously doesn't work, but this should somehow actually commit
status.save(commit=True)
finally:
transaction.commit()
Upvotes: 3
Views: 1115
Reputation: 64769
I solved this by placing the bulk-updated model and the model storing the status record onto different databases, and then disabling transaction management to the former database.
e.g. a simplification of my example above:
django.db.transaction.enter_transaction_management(using='primary')
django.db.transaction.managed(True, using='primary')
i = 0
for record in records:
i += 1
r = PrimaryDBModel(**record)
r.save() # This will no be committed until the end.
if not i % 100:
SecondaryDBModel.update()
status = SecondaryDBModel(id=123)
status.current_row = i
status.save() # This will committed immediately.
django.db.transaction.commit(using='primary')
django.db.transaction.leave_transaction_management(using='primary')
Upvotes: 3