betseyb
betseyb

Reputation: 1332

What is the impact of adding indexes on a MongoDB database containing a large volume of data?

I have a MongoDB database containing millions of tweets, and it appears that to improve query performance, I will need to add one or more additional compound indexes. There are already individual indexes on the fields that will go into the compound index. Before I just go adding these indexes, I want to have some idea of what the impact will be, both short term (is it going to take days to add the initial index?) and long term (is it going to make inserts enough slower that it outweighs the benefit of speeding up the queries?)

This is not a production database - no replicas, no shards. It lives on a low-end linux box which doesn't run much, if anything, else. As I said, millions of tweets. The relevant fields are userid and tweeted_at (a date). There's little enough free space on the machine that I don't have anywhere to put a copy of my db, so trying it out on a copy isn't really helpful. I don't expect exact answers, just some guidance.

Upvotes: 2

Views: 1218

Answers (2)

cunninghamp3
cunninghamp3

Reputation: 111

Your database having millions of tweets is not necessarily an issue. As Markus said, it is possible it could bring your querying to a halt, in particular if you aren't sharded.

The most important additional thing that I have experienced (working with 10s and 100s of millions of records) is not the number of records as much as the size of those records. When I'm dealing with TBs worth of data over a million records it is considerably slower than 100GBs worth of data over 100 million records.

Actual mileage will vary, but I would assume that a database of tweets will have small enough records that indexing should be relatively short-lived, and it shouldn't add a meaningful amount of time to document creation.

Upvotes: 0

Markus W Mahlberg
Markus W Mahlberg

Reputation: 20703

  1. Creating an index will basically lock your database - no reads, no writes.
  2. You can have the indices created in background via the {background:true} option. However, this is kind of a Very Bad Idea™ as it slows down index creation considerably
  3. Queries will be faster by orders of magnitude. If that outweighs the (small) loss of write performance heavily depends on your application. If you do just occasional reading, it might be a waste of time. From my experience, it usually is worth the time except for very special applications. Assuming that you will do some sort of aggregation on the data or some sort of map/reduce operations, I'd say do it.
  4. MongoDB is quite fast and efficient. I just set up a sharded cluster (2 standalone shards, one config server and one mongos on the same machine - my heavily loaded MacBook Air, compiling a Spring application the same time) and creating a compound index on 2 values in 1M docs took roughly 16 seconds. Even when we assume that each mongod used one CPU and the processing was done in parallel, the disk would have been the limiting factor. I think it is save to assume that we are rather talking of minutes than of hours, let alone days.

Edit: To speed up things, you might want to remount the according partition with the noatime flag set.

Upvotes: 2

Related Questions