Reputation: 2970
I have two collections in a db page
and pagearchive
I am trying to clean up. I noticed that new documents were being created in the pagearchive
instead of adding values to embedded documents as intended. So essentially what this script is doing is going through every document in page
and then finding all copies of that document in pagearchive
and moving data I want into a single document and deleted the extras.
The problem is there is only 200K documents in pagearchive
and based on the count variable I am printing at the bottom, it's taking anywhere from 30min to 60+ min to iterate through 1000 records. This is extremely slow. The largest count in duplicate docs I have seen is 88. But for the most part when I query in pageArchive
on uu
, I see 1-2 duplicate documents.
mongodb is on a single instance 64 bit machine with 16GB of RAM.
The uu
key that is being iterating on the pageArchive
collection is a string. I made sure there was an index on that field db.pagearchive.ensureIndex({uu:1})
I also did a mongod --repair
for good measure.
My guess is the problem is with my sloppy python code (not very good at it) or perhaps something I am missing that is necessary for mongodb. Why is it going so slow or what can I do to speed it up dramatically?
I thought maybe because the uu
field is a string it's causing a bottleneck, but that's the unique property in the document (or will be once I clean up this collection). On top of that, when I stop the process and restart it, it speeds up to about 1000 records a second. Until it starts finding duplicates again in the collection, then it goes dog slow again (deleting about 100 records every 10-20 minutes)
from pymongo import Connection
import datetime
def match_dates(old, new):
if old['coll_at'].month == new['coll_at'].month and old['coll_at'].day == new['coll_at'].day and old['coll_at'].year == new['coll_at'].year:
return False
return new
connection = Connection('dashboard.dev')
db = connection['mydb']
pageArchive = db['pagearchive']
pages = db['page']
count = 0
for page in pages.find(timeout=False):
archive_keep = None
ids_to_delete = []
for archive in pageArchive.find({"uu" : page['uu']}):
if archive_keep == None:
#this is the first record we found, so we will store data from duplicate records with this one; delete the rest
archive_keep = archive
else:
for attr in archive_keep.keys():
#make sure we are dealing with an embedded document field
if isinstance(archive_keep[attr], basestring) or attr == 'updated_at':
continue
else:
try:
if len(archive_keep[attr]) == 0:
continue
except TypeError:
continue
try:
#We've got our first embedded doc from a property to compare against
for obj in archive_keep[attr]:
if archive['_id'] not in ids_to_delete:
ids_to_delete.append(archive['_id'])
#loop through secondary archive doc (comparing against the archive keep)
for attr_old in archive.keys():
#make sure we are dealing with an embedded document field
if isinstance(archive[attr_old], basestring) or attr_old == 'updated_at':
continue
else:
try:
#now we know we're dealing with a list, make sure it has data
if len(archive[attr_old]) == 0:
continue
except TypeError:
continue
if attr == attr_old:
#document prop. match; loop through embedded document array and make sure data wasn't collected on the same day
for obj2 in archive[attr_old]:
new_obj = match_dates(obj, obj2)
if new_obj != False:
archive_keep[attr].append(new_obj)
except TypeError, te:
'not iterable'
pageArchive.update({
'_id':archive_keep['_id']},
{"$set": archive_keep},
upsert=False)
for mongoId in ids_to_delete:
pageArchive.remove({'_id':mongoId})
count += 1
if count % 100 == 0:
print str(datetime.datetime.now()) + ' ### ' + str(count)
Upvotes: 3
Views: 1075
Reputation: 9502
I'd make following changes to code:
in match_dates
return None
instead False
and do if new_obj is not None:
it will check reference, without calling object __ne__
or __nonzero__
.
for page in pages.find(timeout=False):
If only uu
key is used and pages are big, fields=['uu']
parameter to find
should speedup queries.
archive_keep == None
to archive_keep is None
archive_keep[attr]
is called 4 times. It will be little faster to save keep_obj = archive_keep[attr]
and then use keep_obj
.
change ids_to_delete = []
to ids_to_delete = set()
. Then if archive['_id'] not in ids_to_delete:
will be O(1)
Upvotes: 3