pymongo taking over 24 hours to loop through 200K records

Question

I have two collections in a db page and pagearchive I am trying to clean up. I noticed that new documents were being created in the pagearchive instead of adding values to embedded documents as intended. So essentially what this script is doing is going through every document in page and then finding all copies of that document in pagearchive and moving data I want into a single document and deleted the extras.

The problem is there is only 200K documents in pagearchive and based on the count variable I am printing at the bottom, it's taking anywhere from 30min to 60+ min to iterate through 1000 records. This is extremely slow. The largest count in duplicate docs I have seen is 88. But for the most part when I query in pageArchive on uu, I see 1-2 duplicate documents.

mongodb is on a single instance 64 bit machine with 16GB of RAM. The uu key that is being iterating on the pageArchive collection is a string. I made sure there was an index on that field db.pagearchive.ensureIndex({uu:1}) I also did a mongod --repair for good measure.

My guess is the problem is with my sloppy python code (not very good at it) or perhaps something I am missing that is necessary for mongodb. Why is it going so slow or what can I do to speed it up dramatically?

I thought maybe because the uu field is a string it's causing a bottleneck, but that's the unique property in the document (or will be once I clean up this collection). On top of that, when I stop the process and restart it, it speeds up to about 1000 records a second. Until it starts finding duplicates again in the collection, then it goes dog slow again (deleting about 100 records every 10-20 minutes)

from pymongo import Connection
import datetime


def match_dates(old, new):
    if old['coll_at'].month == new['coll_at'].month and old['coll_at'].day == new['coll_at'].day and old['coll_at'].year == new['coll_at'].year:
        return False

    return new

connection = Connection('dashboard.dev')


db = connection['mydb']

pageArchive = db['pagearchive']
pages = db['page']

count = 0
for page in pages.find(timeout=False):

    archive_keep = None
    ids_to_delete = []
    for archive in pageArchive.find({"uu" : page['uu']}):

        if archive_keep == None:
            #this is the first record we found, so we will store data from duplicate records with this one; delete the rest
            archive_keep = archive
        else:
            for attr in archive_keep.keys():
                #make sure we are dealing with an embedded document field
                if isinstance(archive_keep[attr], basestring) or attr == 'updated_at':
                    continue
                else:
                    try:
                        if len(archive_keep[attr]) == 0:
                            continue
                    except TypeError:
                        continue
                    try:
                        #We've got our first embedded doc from a property to compare against
                        for obj in archive_keep[attr]:
                            if archive['_id'] not in ids_to_delete:
                                ids_to_delete.append(archive['_id'])
                            #loop through secondary archive doc (comparing against the archive keep)
                            for attr_old in archive.keys():
                                #make sure we are dealing with an embedded document field
                                if isinstance(archive[attr_old], basestring) or attr_old == 'updated_at':
                                    continue
                                else:
                                    try:
                                        #now we know we're dealing with a list, make sure it has data
                                        if len(archive[attr_old]) == 0:
                                            continue
                                    except TypeError:
                                        continue
                                    if attr == attr_old:
                                        #document prop. match; loop through embedded document array and make sure data wasn't collected on the same day
                                        for obj2 in archive[attr_old]:
                                            new_obj = match_dates(obj, obj2)
                                            if new_obj != False:
                                                archive_keep[attr].append(new_obj)
                    except TypeError, te:
                        'not iterable'
        pageArchive.update({
                            '_id':archive_keep['_id']}, 
                           {"$set": archive_keep}, 
                           upsert=False)
        for mongoId in ids_to_delete:
            pageArchive.remove({'_id':mongoId})
        count += 1
        if count % 100 == 0:
            print str(datetime.datetime.now()) + ' ### ' + str(count)

reclosedev · Accepted Answer

I'd make following changes to code:

in match_dates return None instead False and do if new_obj is not None: it will check reference, without calling object __ne__ or __nonzero__.
for page in pages.find(timeout=False): If only uu key is used and pages are big, fields=['uu'] parameter to find should speedup queries.
archive_keep == None to archive_keep is None
archive_keep[attr] is called 4 times. It will be little faster to save keep_obj = archive_keep[attr] and then use keep_obj.
change ids_to_delete = [] to ids_to_delete = set(). Then if archive['_id'] not in ids_to_delete: will be O(1)

pymongo taking over 24 hours to loop through 200K records

Answers (1)

Related Questions