Reputation: 28179
I have a huge collection with ~1 billion documents, very few of those documents (less then 200) contain some field "rare_field
".
How can I, as quickly as possible find all documents containing that field?
If I simply do:
collection.find({ "rare_field" : { $exists : true }})
it times out. This might take days to complete, so I'm not sure that even preventing timeout via a query flag will help here, but maybe I'm wrong.
I can also write a script to go over all documents but this will be slow as it'll need to pass all 1 billions documents over the wire to my server, I want some solution that will not require any passing of data on the wire, and will be quick.
Notes: this is a sharded
collection.
I'll post my current solution as an answer, but I'm not sure it's 100% correct, and it's not as fast as I want it.
Upvotes: 0
Views: 274
Reputation: 488
Without an index on rare-field
, in the worst case, mongodb will need to go over all documents in the collection. In such a case the cursor may time out, so you need to add a flag to the cursor to prevent it from doing so.
In the mongo shell, this would mean a query like:
var cursor = db.collection.find({ "rare_field" : { $exists : true }}).noCursorTimeout();
If your concern is that network problems or other issues will interrupt the query before the cursor has been populated with batchSize matching documents, then you can indeed get the documents one by one as you suggested in your answer, however you'll need to sort by { _id: 1 }
, and use noCursorTimeout()
and limit(1)
, i.e.:
var doc = db.collection.find({ "rare_field" : { $exists : true }})
.sort({ _id: 1 })
.limit(1)
.noCursorTimeout()
.next();
Then as you suggested, retrieve the next document by repeating the query while adding the condition { _id: { $gt: doc._id } }
to the query object.
Upvotes: 2
Reputation: 28179
A possible solution is to use findOne
instead of find:
var doc = collection.findOne({ "rare_field" : { $exists : true }});
and then loop to get the next one:
var doc = collection.findOne({ _id : { $gt : doc._id}, "rare_field" : { $exists : true }});
However, I'm not 100% sure that these calls must give me results ordered by _id
, and I'm not sure that single findOne
won't timeout as well.
I fear that explicitly adding sort({_id : 1})
will force the query to get all results, and then to deliver the first one.
Upvotes: 0