Can bulk opreration in mongodb 2.6+ be used as a buffer/queue?

Question

MongoDB introduced Bulk() since version 2.6, I checked the APIs, it's seems great to me.

Before this API, if I need to do a bulk insert, I have to store documents in a List, them use insert() to insert the whole List. In a multi-thread environment, concurrency should also be considered.

Is there a queue/buffer implemented inside the bulk API? each time I put something into the bulk before execute(), the data is stored int he queue/buffer, is that right?
Thus, I don't need to write my own queue/buffer, just use Bulk.insert() or Bulk.find().update(), is that right?
Could someone tell me more about the queue. Do I still need to concern the concurrency issues?
Since a Bulk is created like db.collection.initializeUnorderedBulkOp(), so if a bulk instance is not released, it will stay connected to the MongoDB server, is that right?

Neil Lunn · Accepted Answer

From the basic idea of "do you need to store your own list?", then not really, but I suppose it all really depends on what you are doing.

For a basic idea of the internals of what is happening under the Bulk Operations API the best place to look is at the individual command forms for each type of operation. So the relevant manual section is here.

So you can think of the "Bulk" interface as being a list or collection of all of the operations that you add to it. And you can pretty much add to that as much as you wish to ( within certain memory and practical constraints ) and consider that the "drain" method for this "queue" is the .execute() method.

As noted in the documentation there, regardless of how many operations you "queue" this will only actually send to the server in groups of 1000 operations at a time at maximum. The other thing to keep in mind is that there is no governance that makes sure that these 1000 operations requests actually fit under the 16MB BSON limit. So that is still a hard limit with MongoDB and you can only effectively form one "request" at a time that totals in less than that data limit in size when sending to the server.

So generally speaking, it is often more practical to make your own "execute/drain" requests to the sever once per every 1000 or less entries or so. Mileage may vary on this but there are some considerations to make here.

With respect to either "Ordered" or "UnOrdered" operations requests, in the former case all queued operations will be aborted in the event of an error being generated in the batch sent. Meaning of course all operations occuring after the error is encountered.

In the later case for "UnOrdered" operations, there is not fatal error reported, but rather in the WriteResult that is returned you get a "list" of any errors that are encountered, in addition to the "UnOrdered" meaning that the operations are not necessarily "applied" in any particular order, which means you cannot "queue" operations that rely on something else in the "queue" being processed before that operation is applied.

So there is the concern of how large a WriteResult you are going to get and indeed how you handle that response in your application. As stated earlier, mileage may vary to the importance of this being a very large response to a smaller and manageable response.

As far and concurrency is concerned there is really one thing to consider here. Even though you are sending many instructions to the sever in a single call and not waiting for individual transfers and acknowledgements, it is still only really processing one instruction at a time. These are either ordered as implied by the initialize method, or "un-ordered" where that is chosen and of course the operations can then run in "parallel" as it were on the server until the batch is drained.

But there is no "lock" until the "batch" completes, so it is not a substitute for a "transaction", so don't make that mistake as a design point. The same MongoDB rules apply, but the benefit here is "one write to server" and "one response back", rather that one for each operation.

Finally, as to whether there is some "server connection" held here by the API, then the answer is not there is not. As pointed to by the initial points of looking at the command internals, this "queue" building is purely "client side only". There is no communication with the server in any way until the .execute() method is called. This is "by design" and actually half the point, as mainly we don't want to be sending data to the server each time you add an operation. It is done all at once.

So "Bulk Operations" are a "client side queue". Everything is stored within the client side until the .execute() "drains" the queue and sends the operations to the server all at once. A response is then given from the server containing all of the results from the operations sent that you can handle however you wish.

Also, once .execute() is called, no more operations can be "queued" to the bulk object, and neither can .execute() be called again. Depending on implementation, you can have some further examination of the "Bulk" object and results. But the general case is where you need to send more "bulk" operations, you re-initialize and start again, just as you would with most queue systems.

Summing up:

Yes. The object effectively "queues" operations.
You don't need your own lists. The methods are "list builders" in themselves
Operations are either "Ordered" or "Un-Ordered" as far as sequence, but all operations are individually processed by the server as per normal MongoDB rules. No transactions.
The "initialize" commands do not talk to the server directly and do not "hold connections" in themselves. The only method that actually "talks" to the server is .execute()

So it is a really good tool. You get much better write operations that you do from legacy command implementations. But do not expect that this offers functionality outside of what MongoDB basically does.

Can bulk opreration in mongodb 2.6+ be used as a buffer/queue?

Answers (1)

Related Questions