Reputation: 18119
I'm currently faced with removing 100M+ documents from several collections ranging from 100k documents to 100M documents in a database with ~300M documents in total. Additionally, each document has references in other collections which must be nullified. I have a list of collection + ID of all the documents to be removed, and the goal is to remove them as quickly as possible, so as to have minimal impact on users.
My current approach is to send groups of 5k to be deleted via { _id: { $in: [] } }
and in parallel send updates to all of the collections referencing those documents in the same grouped fashion. This turned out to be very slow so I'm looking for alternatives.
I just read about the Bulk Write API and I'm wondering if that might be a better solution. If so, I'm curious what the most efficient way to make use of it is. Should I keep grouping as I am now, but send several groups at once in one Bulk request? Should I stop grouping in the query and instead use a Bulk request as my group with 5k individual delete/update commands?
Upvotes: 3
Views: 3124
Reputation: 222771
Based on your requirements (you have 300mln documents and you need to remove 1/3 of them) I would investigate doing the following:
Create new collections and then populate it with elements that satisfy your requirements. Then, drop the previous collections and rename your temporary collection.
A little bit more explanation. You have a collection A with N documents. You need to remove N/3 documents based on ID. Create a hash table of your IDs for deletion. Then iterate collection and for each element whose ID is not in hash you save it in the new temporary collection. You can improve it by saving them in bulk insert.
Why I think that it will be faster: when you use an index, each search requires O(log(n))
time so potentially you will spend O(n * log(k))
. In my case searching in a hash map is O(1) and therefore I hope for O(n)
Upvotes: 3
Reputation: 18119
Because we can't afford user downtime and the solution is to be run on a daily basis (albeit at a much smaller scale, as we're catching up with this first run) I couldn't use Salvador Dali's solution. I ended up grouping my records-to-be-deleted into groups of 1k and sending a BulkWrite command containing one delete()
operation for each record. In parallel I sent n BulkWrite commands to nullify references to each record, where n is the number of collections that reference the records and where each BulkWrite request has 1k individual update()
operations, similar to the delete()
. This performed reasonably fast so I didn't attempt to further optimize by adjusting the number of operations in the BulkWrite commands.
Upvotes: 6