Reputation: 728
I created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
I started creating index about 3 hours ago and it doesn't seem like it could finish in another 3 hours.
How can I increase speed of indexing? Or maybe I should use completely another approach to this problem? Any ideas are welcome :)
Upvotes: 18
Views: 28434
Reputation: 1388
In 2024, there are two great options for creating large indexes in MongoDB replica sets and sharded clusters. Option #1 is widely recommended, however the index I had to create was going to take around 10 days per node running the database from a traditional HDD. This is not viable for large replica sets/sharded clusters, so Option #2 was much faster for me
Option #1 - Rolling Index Build
Option #2 - Create the index on one node (per replica set), backup the database files, and then copy the database backup to all the other node(s). You will just want to make sure the Oplog is large enough to capture the time window it takes to fully transfer the database backup. The index build will also complete order(s) of magnitude faster if you create the index on a node with the database running on a SSD or NVMe
Upvotes: 0
Reputation: 557
I also faced same issue, resolved through a work around with the following steps.
db.<existing_collection>.renameCollection("<rename_collection>");
db.<existing_collection>.createIndex({<option_goes_here>});
db.<rename_collection>.find().forEach(function(doc){db.<existing_collection>.insert(doc);});
db.<rename_collection>.drop();
Upvotes: 0
Reputation: 404
Background indexes also have some issues.
If you have a replica set I prefer to do a "rolling index build".
I think this is the cleanest solution.
Upvotes: 4
Reputation: 2258
Nope, indexing is slow for large collections. You can create the indexing in the background as well:
db.my_coll.ensureIndex({words:1}, {background:true});
Creating the index in the background will be slower and result in a larger index. However, it won't be used until the indexing is complete, so in the meantime you'll be able to use the database normally and the indexing won't block.
Upvotes: 24