John Greenall
John Greenall

Reputation: 1690

Any way of prioritizing order Mongod loads things into memory after startup?

My use case will perhaps seem strange on first inspection however I believe in principle what I'm doing is a good way of scaling up massively in a short space of time without any impact on live service.

Our live database is run on a 3 member multi-region replica set on Amazon EC2. We backup regularly by snapshotting the EBS journal and data volumes. It is therefore super easy to spin up standalone clones of our database based on the most recent snapshot. We periodically have some heavy/complex aggregation jobs that require us to do stuff programmatically not possible in aggregation pipeline and need to pull large amounts of data from the database. Have found that pulling the data from active replica set members hampers performance and have therefore been spinning up boxes with standalone mongo servers that contain data from the latest snapshot. This works really nicely though it seems to take around 30 mins before the mongo servers become performant which I guess is due to all indices etc being loaded into memory.

The thing is that I only actually want to access one or two collections from the database. I'm wondering if there is a way to prioritize the collections I wish to use or else drop the collections I don't want without loading them into memory?

Upvotes: 1

Views: 136

Answers (2)

John Petrone
John Petrone

Reputation: 27497

Part of what you are experiencing is a performance hit for new EBS volumes that have been created from a snapshot. From the EC2 documentation:

When you create a new EBS volume or restore a volume from a snapshot, the back-end storage blocks are allocated to you immediately. However, the first time you access a block of storage, it must be either wiped clean (for new volumes) or instantiated from its snapshot (for restored volumes) before you can access the block. This preliminary action takes time and can cause a 5 to 50 percent loss of IOPS for your volume the first time each block is accessed. For most applications, amortizing this cost over the lifetime of the volume is acceptable. Performance is restored after the data is accessed once.

I can think of 3 ways to address this issue:

  1. The rest of the documentation http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-prewarm.html discusses ways to pre-warm the EBS volume prior to bringing it live (basically by reading all of the blocks on the volume). This may or may not be a useful way to address the problem for you.
  2. You could hit specific database files as suggested by a mongodb blog post http://blog.mongodb.org/post/10407828262/cache-reheating-not-to-be-ignored :

    On a server restart, copy datafiles to /dev/null to force reheating to be sequential and thus much faster. This can be done even if the mongod process is already running. If the database is larger than RAM, copy only the newest datafiles (ones with the highest numbers); while this isn’t perfect, the latest files likely contain the largest percentage of frequently used data.

  3. You could use your already pre-warmed EBS volume from a secondary instead of the newly created volume. In this case you would take your secondary down, swap in the newly created volume and take the old volume for use with the new instance. This would take a few minutes but would provide you with a fully warmed up EBS volume probably faster than trying to pre-warm it yourself and performance should be better.

Upvotes: 1

Andrei Beziazychnyi
Andrei Beziazychnyi

Reputation: 2917

If I understand you correctly, you use dedicated Amazon EC2 instance for getting some data out and doing aggregation on client side.

Once you start EC2 instance memory would be in cold state, i.e. no data and indices are loaded in memory. Once you begin to send queries to box, only data that is accessed by queries and some indexes would be loaded in memory. So if you only use this box for getting some data, only data you need would be loaded. No need to prioritize, because only required data is loaded

You mentioned that it affects performance. Could you explain what do you mean? What amount of data do you try to retrieve on client?

Upvotes: 1

Related Questions