Reputation: 1277
I am trying to model a database that needs a very high write throughput, and reasonable read throughput. I have a distributed set of systems that are adding "event" data into the database.
Currently, the id for the event record is a Guid. I have been reading that guids don't tend to create great indexes because their random distribution means that recent data will be scattered in the disk, which can lead to paging problems.
So here is the first assumption I would like to validate: I am assuming that I wan't to choose an _id that creates a right balanced tree, such as something like an autonumber. This would be beneficial because the 2 most recent events would essentially be right next to each other on disk. Is this a correct assumption?
Assuming that (1) is correct, then I am trying to work out the best way to generate such an id. I know Mongo natively supports ObjectId, which is convenient for applications that are ok tying their data to Mongo, but my application isn't such. Since there are multiple systems producing data, simulating an "auto-number" field is a little problematic because mongo doesn't support auto-number at the server side, so the producer would have to assign the id, which is hard if they don't know what the other systems are doing.
In order to solve for this, what I am considering doing is making the _id field a compound key on { localId, producerId } where local id is an autonumber that the producer can generate because producerId will make it unique. ProducerId is something that I can negotiate among producers so that they can come up with unique ids.
So here is my next question: If my goal is to get the most recent data from all producers, then { localId, producerId } should be the preferred key ordering since localId will be right-ist and producerId will be a small cluster, and I would prefer that the 2 most recent events stay local to each other. If I inverted that order, then my reasoning for how the tree would eventually look would be something like the following:
root
/ | \
p0 p1 p2
/ | \
e0..n e0..n e0..n
where p# is the producer Id, and e# is an event. This seems like it would fragment my index into p# clusters of data, and new events wouldn't necessarily be next to each other. My assumption for the preferred ordering should (please verify) look something like this instead:
root
/ | \
e0 e1 e2
/ | \
p0..n p0..n p0..n
which would seem to keep recent events near each other. ( I know that Mongo uses B-trees for indexes, but I am just trying to simplify the visual here ).
The only caveat to { localId, producerId } that I can see is that a common query by the user would be to list the most recent events by producer, which { producerId, localId } would actually handle much better. In order to get this query to work with { localId, producerId }, I am thinking that I will also need to add the producerId as a field to the document, and index that.
To be explicit about what my question here really is, I want to know if I am thinking about this problem correctly, or if there is an obviously better way to approach this.
Thanks
Upvotes: 0
Views: 926
Reputation: 461
To answer your question: a compound like this: {a,b} will end in scatter queries if you just query by b and then sort by a. but it will use the index for sorting.
If you use a Document instead of ObjectId, _id will be indexed but not used but it is not a compound index!
Example:
Given this Documents in Collection 'a' and no additional index:
{ "_id" : { "e" : 1, "p" : 1 } }
{ "_id" : { "e" : 1, "p" : 2 } }
{ "_id" : { "e" : 2, "p" : 1 } }
{ "_id" : { "e" : 1, "p" : 3 } }
{ "_id" : { "e" : 2, "p" : 3 } }
{ "_id" : { "e" : 2, "p" : 2 } }
{ "_id" : { "e" : 3, "p" : 1 } }
{ "_id" : { "e" : 3, "p" : 2 } }
{ "_id" : { "e" : 3, "p" : 3 } }
a query like this:
db.a.find({'_id.p' : 2}).sort({'_id.e' : 1}).explain()
will NOT use an index:
{
"cursor" : "BasicCursor",
"nscanned" : 9,
"nscannedObjects" : 9,
"n" : 3,
"scanAndOrder" : true,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
Just because the Documents are indexed.
If you create an index like this:
db.a.ensureIndex({'_id.e' : 1, '_id.p' : 1})
and then query again:
db.a.find({'_id.p' : 2}).sort({'_id.e' : 1}).explain()
{
"cursor" : "BtreeCursor _id.e_1__id.p_1",
"nscanned" : 9,
"nscannedObjects" : 3,
"n" : 3,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"_id.e" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"_id.p" : [
[
2,
2
]
]
}
}
it will query on the index (nscanned: 9) because of the sort and then fetches the objects : 3, which is better than sorting by _id (nscanned and nscannedObjects would be 9).
So for high write throughput (over 15k writes a sec) you would probably shard. Both Indexes would guarantee uniqueness if option isset. But only a compound shard key will help you for direct queries and no scatter gather.
Using ({'_id.e' : 1, '_id.p' : 1}) as a shard key will route all "_id.e" queries directly but not "_id.p" (without 'e') queries, so these queries will send to every host and end in index lookups there but could be fast aswell (depends ond network etc). If you want to cluster these queries by "p" you have to put '_id.p' as the first part of the compound key like so:
{'_id.p' : 1, '_id.e' : 1}
So all "p" queries are direct queries. But yes, this would scatter recent events across the cluster. So a separate index using the time based key might speed up those scatter queries.
I would generate me some sample data and would play around with it in a setup with two shards on a dev system and use .explain() for choosing the shard key + indexes.
Upvotes: 1