Optimization question in MapReduce in MongoDB

Question

So my friend and I are trying to do a map reduce on a collection that has items being added to it consistently.

Basically we calculate the average of some fields and put them in a collection (via map reduce).

Here is the issue, every time map reduce is ran it goes through ALL of the documents. I'm new to map reduce, but based on what I know it seems that it would be super efficient if it only ran map reduce on the new and/or modified documents and update them with the existing collection.

So I was like OK, I'll just do it myself. Added a "processed: false" to the collection, and when the map reduce runs I pass in a query filter "{processed:false}" then after the map reduce runs I then set "{processed:true}" to all the items where processed = false.

Here is the issue. I am worried about edge case. What happens if during the map reduce some items are added to the collection? They were never passed into the map reduce, and now after the map reduce runs their processed flag is set to true.

What would be great is if instead of passing in a "query filter" into mongo, that I would be able to pass in a query object "set" So then I can set the processed flag to true and then pass in those objects.

Remon van Vliet · Accepted Answer

Make it a 3 step thing. Have 3 states, say UNPROCESSED, MARKEDFORPROCESSING and PROCESSED, then :

db.col.update({processingState:UNPROCESSED}, {$set:{processingState:MARKEDFORPROCESSING}}, false, true)
Run m/r against MARKEDFORPROCESSING documents. These are guaranteed to have been there at m/r start.
db.col.update({processingState:MARKEDFORPROCESSING }, {$set:{processingState:PROCESSED}}, false, true)
Go to 1.

This avoids your edge case and given MongoDB's atomic updates is completely safe.

Optimization question in MapReduce in MongoDB

Answers (1)

Related Questions