Reputation: 3483
I'd like to perform efficient operations of this form with mongodb:
db.getCollection('x').find({a:{$ne:null}})
My understanding is that an index on a
will not include documents which are missing the field a
. So queries of the form {a:{$ne:null}}
need to scan for those documents (i.e. can't rely solely on the index to find all the matching documents).
I'm considering a mongo feature request (if one hasn't been submitted already) to allow indices to optionally include documents with missing values. I'm wondering:
I know there are a lot of questions here about indices and "null" (the null
value versus a missing value, etc), but I spent a bit of time and couldn't find a direct answer to this question.
As a real example, I have a collection with ~80 million documents. About 1,000 of those documents are missing the field a
. I'd like to be able to iterate over those documents that are missing a
(in any order). One workaround is to make sure they're never missing a
and just set it to -1 or some other particular value. That seems a bit silly to me - there should be a way to have mongo do that for me under the hood.
Upvotes: 1
Views: 756
Reputation: 14520
There are two relevant features of an index that permit fast lookups (compared to performing a collection scan):
If you have values a
and b
, you can say that a
comes before b
, lexicographically. If you have documents {a: 2, b: 5}
and {b: 4, a: 3}
, there is no single ordering of these documents that would satisfy typical queries. For example, if you want all of the a
values ordered then you might expect 2, 3
, but if you want the b
values you might expect 4, 5
- requiring document order to be reversed.
When a database stores the index on disk, the values are stored in index order (whatever that might be for the particular index, e.g. collation affects this). Generally there is no single ordering that is usable for collection documents overall, hence collection documents are unordered.
When you query by index, you take the value being searched and essentially execute a binary search using the index because the data in the index is sorted.
The second reason to use the index is, if you are scanning the collection, for each document, the entire document typically needs to be retrieved from disk and skipped over. If you have a 100 GB collection and you are doing a scan you might need to skip over 100 GB of data. If the same collection has a 100 MB index on some field (because the index only stores the values in that field and not the entire document's worth of data), and the database performs a complete index scan, it only has to traverse 100 MB of data.
Now, to your question about storing lack of values in an index.
From the index's standpoint, the "lack of a value" in different documents is the same value. You lose the ability to do binary search when all of your values are identical. So if you are looking for that "lack of value" document, the index will give you back all of the documents in the collection that lack the value, and then you have to do a scan through them anyway to filter by whatever other conditions you have. Since this generally produces bad selectivity the databases don't bother with indexes and do collection scans in the first place.
And, most likely you want some other fields out of your query, not the field that doesn't have a value. So now you want the index to store complete documents, defeating the compactness idea.
Upvotes: 1