Reputation: 1690
My use case will perhaps seem strange on first inspection however I believe in principle what I'm doing is a good way of scaling up massively in a short space of time without any impact on live service.
Our live database is run on a 3 member multi-region replica set on Amazon EC2. We backup regularly by snapshotting the EBS journal and data volumes. It is therefore super easy to spin up standalone clones of our database based on the most recent snapshot. We periodically have some heavy/complex aggregation jobs that require us to do stuff programmatically not possible in aggregation pipeline and need to pull large amounts of data from the database. Have found that pulling the data from active replica set members hampers performance and have therefore been spinning up boxes with standalone mongo servers that contain data from the latest snapshot. This works really nicely though it seems to take around 30 mins before the mongo servers become performant which I guess is due to all indices etc being loaded into memory.
The thing is that I only actually want to access one or two collections from the database. I'm wondering if there is a way to prioritize the collections I wish to use or else drop the collections I don't want without loading them into memory?
Upvotes: 1
Views: 136
Reputation: 27497
Part of what you are experiencing is a performance hit for new EBS volumes that have been created from a snapshot. From the EC2 documentation:
When you create a new EBS volume or restore a volume from a snapshot, the back-end storage blocks are allocated to you immediately. However, the first time you access a block of storage, it must be either wiped clean (for new volumes) or instantiated from its snapshot (for restored volumes) before you can access the block. This preliminary action takes time and can cause a 5 to 50 percent loss of IOPS for your volume the first time each block is accessed. For most applications, amortizing this cost over the lifetime of the volume is acceptable. Performance is restored after the data is accessed once.
I can think of 3 ways to address this issue:
You could hit specific database files as suggested by a mongodb blog post http://blog.mongodb.org/post/10407828262/cache-reheating-not-to-be-ignored :
On a server restart, copy datafiles to /dev/null to force reheating to be sequential and thus much faster. This can be done even if the mongod process is already running. If the database is larger than RAM, copy only the newest datafiles (ones with the highest numbers); while this isn’t perfect, the latest files likely contain the largest percentage of frequently used data.
Upvotes: 1
Reputation: 2917
If I understand you correctly, you use dedicated Amazon EC2 instance for getting some data out and doing aggregation on client side.
Once you start EC2 instance memory would be in cold state, i.e. no data and indices are loaded in memory. Once you begin to send queries to box, only data that is accessed by queries and some indexes would be loaded in memory. So if you only use this box for getting some data, only data you need would be loaded. No need to prioritize, because only required data is loaded
You mentioned that it affects performance. Could you explain what do you mean? What amount of data do you try to retrieve on client?
Upvotes: 1