Reputation: 107
The scenario I have is say I have 2 million documents in mongo and I want to process it in batch of say 100 or 1000 (coz the v8 memory is scarce ) and after reading the batch size of the documents I want to do some computation and write it to a file which might take longer than 10 minutes before I come and get the next set of batch size documents. How can I do that with node.js mongo db driver?
I couldn't find all the methods I need in node.js mongo db driver.for example mongo shell has docs.leftInTheBatch which tells how many documents are left in the current batch and this is not available in node.js.
Another important functionality I was looking for in node.js mongo db driver is how to set the cursor to not timeout (This is possible in mongo shell and other language drivers but I am not sure on node.js)?
var hash_map = {};
db.collection(collection_name).find().batchSize(100).each(function(err, docs) {
docs.each(function(err, doc) {
var id = doc._id; // assume this is a string not objectID
hash_map[id] = doc.key1;
})
// This async function would take say 20 minutes or just assume it takes long time. now, would the cursor time out before I retrieve the next batch?
async.series([
prcocessData.bind(null, hash_map),
writeDataToFile
], function(err){
if(err) throw err;
return callback();
});
});
Upvotes: 3
Views: 3859
Reputation: 151112
This is a wrong interpretation of the usage of "batchSize". All that means ( and essentially as a parameter to the the cursor return of .find()
, despite the driver method ) is that that server will return a "batch" of 100 results ( in this case ) at a time, which is then to be iterated as a "cursor".
You are missing the concept of a "cursor". You do not "actually" return a "data" result that contains 100
records or "collection items" in the overall result. You just have a "pointer" that allows you to "fetch" a single "record/document" at a time on a .next()
method.
Convenience methods like .each()
and .toArray()
are meant for "small" result sets where the results are basically "transformed" into an array for further processing. Either manually via .toArray()
or implicitly via methods like .each()
.
For large result sets you want the "stream" API provided by node and the MongoDB driver. See here in the documenation for how to invoke that on current versions.
Newer releases of the MongoDB node driver will return a node stream interface by default.
Point being that you "could" use cursor modifier such as .limit()
here and "loop" the results in "pages", but in you context this is not the most efficient way. Look into the streaming API as referenced by the links.
Upvotes: 3