Leopd
Leopd

Reputation: 42757

Efficient data migration on a large django table

I need to add a new column to a large (5m row) django table. I have a south schemamigration that creates the new column. Now I'm writing a datamigration script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm. prefixing the model name.)

print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():            
    if cnt % 500 == 0:
        print "    %s done so far" % cnt
    # article.newfield = calculate_newfield(article)
    article.save()
    cnt += 1

I switched from objects.all to objects.iterator to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.

Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?

FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.

Upvotes: 9

Views: 6677

Answers (5)

Steve K
Steve K

Reputation: 11369

Ensure settings.DEBUG is set to False. DEBUG=True fills memory especially with database intensive operations, since it stores all queries sent to the RDBMS within a view.

With Django 1.8 out, it should not be necessary since a hardcoded max of 9000 queries are now stored, instead of an infinite number before.

Upvotes: 7

John Mee
John Mee

Reputation: 52253

orm.Article.objects.iterator()

Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?

I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:

eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany

db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
    SELECT  *
    FROM    whatever
""")
for row in old.fetchmany(size=500):
    (col1, col2, col3...) = row
    new = db.cursor()
    new.execute("""
        INSERT INTO yourtable (
            col1, col2, col3...)
        VALUES (
            %s, %s, %s, %s, %s)
        """,(col1, col2, col3,...))
new.close()
old.close()

It will be slow. I pulled this from a standalone migration script of mine so ymmv.

fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.

Upvotes: 2

John Mee
John Mee

Reputation: 52253

Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?

a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups

while min < rowcount:
  min += 500
  max = min + 500
  articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
  for old_article in articles:
    # create the new article
    article.save()

Upvotes: 2

Chris Pratt
Chris Pratt

Reputation: 239290

If you don't need full access to the objects, you can always use a combo of only and values or values_list on your queryset. That should help reduce the memory requirements significantly, but I'm not sure whether it will be enough.

Upvotes: 1

S.Lott
S.Lott

Reputation: 391852

Welcome to Django's ORM. I think this is an inherent problem.

I've also had problems with large databases, dumpdata, loaddata and the like.

You have two choices.

  1. Stop trying to use south and write your own ORM migration. You can have multiple database definitions in your settings. Create "old" and "new". Write your own one-time migrator from the old database to the new database. Once that's tested and works, run it one final time and then switch the database definitions and restart Django.

  2. Ditch south and the ORM and write your own SQL migration. Use raw SQL to copy data out of the old structure to the new structure. Debug separately. When it's good, run it one final time and then switch your setting and restart Django.

It's not that south or the ORM are particularly bad. But, for bulk processing in large databases, they cache too much in memory.

Upvotes: 2

Related Questions