Reputation: 8333
I am trying to push some big files (around 4 million records) into a mongo instance. What I am basically trying to achieve is to update the existent data with the one from the files. The algorithm would look something like:
rowHeaders = ('orderId', 'manufacturer', 'itemWeight')
for row in dataFile:
row = row.strip('\n').split('\t')
row = dict(zip(rowHeaders, row))
mongoRow = mongoCollection.find({'orderId': 12344})
if mongoRow is not None:
if mongoRow['itemWeight'] != row['itemWeight']:
row['tsUpdated'] = time.time()
else:
row['tsUpdated'] = time.time()
mongoCollection.update({'orderId': 12344}, row, upsert=True)
So, update the whole row besides 'tsUpdated' if weights are the same, add a new row if the row is not in mongo or update the whole row including 'tsUpdated' ... this is the algorithm
The question is: can this be done faster, easier and more efficient from mongo's point of view ? (eventually with some kind of bulk insert)
Upvotes: 4
Views: 4930
Reputation: 400
Combine an unique index on orderId
with an update query where you also check for a change in itemWeight
. The unique index prevents an insert with only a modified timestamp if the orderId
is already present and itemWeight
is the same.
mongoCollection.ensure_index('orderId', unique=True)
mongoCollection.update({'orderId': row['orderId'],
'itemWeight': {'$ne': row['itemWeight']}}, row, upsert=True)
My benchmark shows a 5-10x performance improvement against your algorithm (depending on the amount of inserts vs updates).
Upvotes: 6