Reputation: 684
I have been struggling to deploy a large database. I have deployed 3 shard clusters and started indexing my data. However it's been 16 days and I'm only half way through.
Question is, should I import all data to a non sharded cluster and then activate sharding once the raw data is in the database and then attach more clusters and start indexing? Will this auto balance my data?
Or I should wait another 16 days for the current method I am using...
*Edit: Here is more explanation of the setup and data that is being imported...
So we have 160 million documents that are like this
"_id" : ObjectId("5146ae7de4b0d58a864bcfda"),
"subject" : "<concept/resource/propert/122322xyz>",
"predicate" : "<concept/property/os/123ABCDXZYZ>",
"object" : "<http://host/uri_to_object_abcdy>"
Indexes: subject, predicate, object, subject > predicate, object > predicate Shard keys: subject, predicate, object
Setup: 3 clusters on AWS (each with 3 Replica sets) with each node having 8 GiB RAM (Config servers are within each cluster and Mongos is in a separate server)
The data gets imported by a Java program into a the Mongos. What would be the ideal way to import this data, index and shard. (without waiting a month for the process to be completed)
Upvotes: 2
Views: 2559
Reputation: 13528
If you are doing a massive bulk insert, it is often faster to perform the insert without an index and then index the collection. This has to do with the way Mongo manages index updates on the fly.
Also, MongoDB is particularly sensitive to memory when it indexes. Check the size of your indexes in your db.stats()
and hook up your DBs to the Mongo Monitoring Service.
In my experience, whenever MongoDB takes a lot more time than expected, it is due to one of two things:
It running out of physical memory or getting itself into a poor I/O pattern. MMS can help diagnose both. Check out the page faults graph in particular.
Operating on unindexed collections, which does not apply in your case.
Upvotes: 1