Reputation: 607
Using Java driver for MongoDB I was trying to insert 25,637,015 documents into MongoDB cluster. The documents were retrieved from SQL Server database and were inserted into initially empty MongoDB sharded collection (called col) in multithreaded fashion (8 concurrent threads). The process took 2 hours. What is interesting and puzzling is that something went on for over 6(!) hours AFTER the program has finished.
Firstly, hard drives in my cluster node computers continued to spin like crazy. Secondly, and more importantly, db.col.count() that ran with a less than second interval continued to render different results:
mongos> db.col.count()
25694898
mongos> db.col.count()
25694917
mongos> db.col.count()
25695154
mongos> db.col.count()
25695207
mongos> db.col.count()
25695422
mongos> db.col.count()
25695493
mongos> db.col.count()
25696024
mongos> db.col.count()
25696130
mongos> db.col.count()
25698565
mongos> db.col.count()
25695145
What is even more intriguing all these counters while going up and down were greater than number of inserted documents: 25,637,015. Had they been smaller I could speculate that the documents went to some sort of queue and are being slowly processes, but greater?!
Like I said after six hours it all stabilized: the hard drives stopped spinning and mongos> db.col.count() has finally rendered correct number: 25637015.
If it is of any importance. I have 2 replica sets in my sharded cluster. Each replica set has 2 data nodes and 1 arbiter only node. I run 3 config servers. And 3 mongos. All spread between 4 Centos boxes (virtual) running on Windows hosts. Source SQL Server is on yet another physical machine. Balancer was not disabled for the duration of insert or anytime after. My MongoDB version is 2.2.6 64 bit.
Any idea what MongoDB was doing for six hours after Java program has finished inserting? Why count was so high?
Thank you
Upvotes: 2
Views: 393
Reputation: 4183
For most of the drivers, mongodb uses memory to enhance write performance. Your insertion first goes to memory and journal, then it returns at once. By that moment your data is not on disk yet. For more information, have a look at the Write Concern section of MongoDB Manual. That's why your collection keeps growing.
As to the count returns more than accurate number issue, Actually there's a JIRA issue about it. See if it answers your question. Unfortunately it's not fixed yet.
EDIT:
About the time spent, it's hard to say for sure. Depends on your hardware, especially your disk. It would be helpful to run mongostat and mongotop and see what's going on. Once you know if insertion is still running, you'll know if the count result makes sense. Here I found another related JIRA Issue explaining count operation in sharded clusters. Which may lead to your situation. However, it happens only when the the server is migrating. Before going any further, please let me know how your sharded cluster is built. What's your shard key?
Upvotes: 1