Reputation: 11686
Why does MongoDB use ObjectID
instead of a string version of it?
Everything is just as possible with the string version of it (date, ordering, etc) and it makes it easier to transport to the client. Is it just that it's smaller?
Thanks for any info!
Upvotes: 2
Views: 648
Reputation: 52679
I remember someone who used to code conditionals with strings, eg if x == "red" then... and so on. Said it was clearer and easier to understand, yet strangely it performed quite badly.
Its time like this I think everyone should be taught C programming so they'd understand what happens under the covers when you write programs.
Edit: performance in these cases is about the size of the item being compared. If you have a value that fits into a single fetch instruction, the CPU can do a compare in 1 instruction, but if you have something that requires multiple fetches, it has to compare each one in turn. Now, if a string is used, the CPU has to compare each character to see if they are all the same (or stop when it gets to the first difference). That's massively slower than a single comparison. An ID value like the one used in MongoDB is actually a hash, (a 12-byte hash hash at that) and on most CPUs nowadays, a 64-bit value can be compared in 1 instruction. Its all about how the memory is read, and how the CPU can process it. CPUs are ultimately very simple things, they just do a few simple things very fast.
Mongo's hash is 12 bytes: which is unfortunately 96 bits so it doesn't fit into a single 64-bit CPU value, so it can end up as 2 compares. This is still much better than a string - even a 12 byte string will require 12 compares, and I think modern CPUs have 128-bit registers for certain comparison operations, so if the CPU has the right SSE registers, it'll be a single compare op anyway.
I think they chose to pack all those values in as they wanted the time to be included in the hash, along with process and machine ids to make it unique in case 2 machines happened to write simultaneously.
Also unfortunately, they use 4 bytes for the time component, which won't be too useful after the 2038 deadline for 4-byte time values, but it'll still be unique for practical purposes.
It might be possible to do faster string comparisons, using SSE4 for example, but they require the string to have certain restrictions - eg. they must be aligned properly, have a length in multiples of 16 bytes, and not cross page boundaries.
Upvotes: 4