Reputation: 2135
What are the benefits of having an index and a unique constraint on a primary key field that is a UUID? It seems that if I have 25M records and I have to insert a new record, all 25M records have to be searched and checked that they don't have the same UUID with or without an index. Finding that record is also troublesome as uuid's can't be sorted. Am I missing something?
Upvotes: 0
Views: 4237
Reputation: 108390
Yes.
UUID values can be sorted. They might not be sorted in a sequence that you find particularly desirable. But UUIDs are data values. They can be compared (are they equal, is one less than another), and therefore they can be sorted.
Declaring a PRIMARY KEY
effectively creates a UNIQUE index. With some storage engines (e.g. InnoDB) the PRIMARY KEY is the cluster key of the table. With other storage engines (e.g. MyISAM), the table is stored as a heap and the PRIMARY KEY is essentially the same as declaring NOT NULL constraints and adding a UNIQUE INDEX.
Yes, when inserting a row into a table, the storage engine must ensure that no PRIMARY KEY or UNIQUE KEY constraints are violated... the values on the new row being inserted don't duplicate values that are already stored.
And that is equivalent (theoretically) to a check of all 25M rows. But because there's an available index structure, the storage engine doesn't need to check all of the individual rows. It uses the index instead.
Because the index is stored "in order", there are vast swaths of blocks containing rows that don't need to be checked. They don't need to be checked because it's impossible for a row with those particular values of the key columns to be stored in those blocks. The storage engine very efficiently identifies the one block where a row with a "duplicate" key value exists, or would exist.
FOLLOWUP
The answer above mostly refers to MySQL (one of the tags in the question). In terms of PostgreSQL, I believe the points are valid.
As far as using UUID values as the PRIMARY KEY for table, there can be some performance drawbacks as compared to some other choices. Two primary concerns: the space required for storing a UUID, and UUID values aren't generated/inserted sequentially.
A UUID is 128 bits, which is 16 bytes. But that's often converted to "human readable form (?)" of 36 characters. Storing the UUID as 36 characters takes up a lot more room than a simple integer. When a CHAR(36) is used as the PRIMARY KEY, that doesn't get stored in just the primary key index, but also gets stored as the "row pointer" in all of the secondary indexes. That translates into fewer keys per block, which in turn means more blocks in the indexes.
The other issue, with new values being inserted not just at the back of the index, but all over in the index, leads to block splits and fragmentation. We don't have to be overly concerned about all that, because the database handles it for us. But using a UUID as a PRIMARY KEY can translate into measurably "slower performance" (vs using ascending integer values), at least in the test lab.
In terms of "what's the benefit" of adding a secondary index with the PRIMARY KEY as the leading column. In general, there is no benefit.
(I'm not going to rule out the corner cases where having such an index might be beneficial. I'd expect those corner case to involve very long rows in an index organized table, and some specific SQL statements that could make efficient use of a secondary index. But that performance benefit would come at a cost, additional blocks (memory and disk i/o) and the extra work to maintain the secondary index.)
Upvotes: 5