Leabdalla
Leabdalla

Reputation: 656

tips for dealing with millions of documents?

i'm logging many information of 8 machines in a sharded clustered mongodb. it's growing up about 500k documents each day in 3 collections. this is 1gb/day.

my structure is:

for now no one collection has sharded enabled and no one has replica set. I just installed the cluster.

so now I need to run queries in all theses documents and collections to get different statistics. this means many wheres, counts, etc... the first test I made was looping all documents in one collection with PHP and printing the ID. this crashed down the primary shardserver. then I tried some other tests limiting queries by 5k documents and it works...

my question is about a better way to deal with this structure.

Upvotes: 2

Views: 106

Answers (1)

Adam Harrison
Adam Harrison

Reputation: 3421

The solution is probably going to depend on what you're hoping to accomplish long term and what types of operations you're trying to perform.

A replica set will only help you with redundancy and data availability. If you are planning on letting the data continue to grow long term, you may want to consider this as a disaster recovery solution.

Sharding, on the other hand, will provide you with horizontal scaling and should increase the speed of your queries. Since a query crashed your primary shard server, i'm guessing that the data it was attempting to process was too large for it to handle by itself. In this case, it sounds like sharding the collection being used would help, as it would spread the workload across multiple servers. You should also consider if indexes would be helpful to make the queries more efficient.

However, you should consider that sharding with your current set up would introduce more possible points of failure; if any one of disks get corrupted then your entire data set is trashed.

In the end, it may come down to who is doing the heavy lifting, PHP or Mongo?

If you're just doing counts and returning large sets of documents for PHP to process, you might be able to handle performance issues by creating the proper indexes for your queries.

Upvotes: 1

Related Questions