Reputation: 3845
Hey i got a mongodb setup with 3 shards each with 3 replica running on 3 physical servers.The sharding is based on a category id on range so that data is even within the shards
The data i get each week onto the database is huge and i am only querying mostly data for the current or 2 previous days.
So i was trying to configure a shard with no replica to the current setup so that the new shard will contain old data of before 5 days and the old ones the 3 shards in the current setup will have the last 5 days data only.
If this is possible most of the queries will hit the not so big 3 shards and only rare queries hit the replica server for back and there would be some advance in TPS.
Is this anyways possible to configure in mongodb with sharding or replication??
Thanks in Advance
Upvotes: 0
Views: 248
Reputation: 42342
While it might be tempting to use tag aware sharding for this, it's actually not simple, nor is it very efficient. Here is why:
1) your range of keys which should exist on the "old" shard is changing every day. If your cut-off is five days ago, at midnight you will need to update the tags to reflect that it's a new day.
2) as soon as you add the day that was five days ago to the range that should be on the "old" shard the balancer process will need to migrate that data to the old shard. The problem is that this shard will have loads of old data so probably really huge indexes so it'll be much slower to write to it, and reading and removing data from day-5 from your "active" shard(s) may be interfering with the queries on "current" data.
So, maybe it's not such a great option - although it is a valid option to consider.
I would suggest considering something else - maybe insert the data into this cluster and also into another "archival" replica set and then use TTL (time to live) index to "expire" data after it gets to be older than, say, a week. Just something to consider if you don't actually need to query on older data very often.
Another option is leave things the way they are. If your data is well balanced, it means you are already handling more TPS than you would if you were querying against "old" data - remember, only data actually being used is loaded into physical RAM - if you aren't reading some old data, then it'll just quietly sit there on disk. Just make sure that all your queries are using indexes efficiently - a collection scan can negate what I described in an instant!
Upvotes: 1