Deterministic Surrogate Key
I have a table with an integer identity column as a surrogate key for two other columns (int and datetime). To keep the value of this key in sync across test and production environments, I had the idea to make a trigger that sets the surrogate key to some deterministic value instead of the auto-generated identity (in this case a hash of the natural key). The tradeoff, as far as I can tell, is that I introduce the risk of collisions (which can be offset by changing the surrogate column to bigint).
CREATE TRIGGER dbo.TRG_TestTable_SetID ON dbo.TestTable
INSTEAD OF INSERT
AS
BEGIN
insert into dbo.TestTable (ID, IntKey, DateKey, MoreData)
select convert(bigint, hashbytes('md5', convert(binary(4), IntKey) + convert(binary(8), DateKey))),
IntKey, DateKey, MoreData
from inserted
END
Is this a good solution from a design standpoint? Will it still perform better than using the natural composite key as the primary key?
Edit: The int in the natural key is a foreign key to another table, where it is the surrogate key for a guid and a varchar. So the "natural key" alternative on this table would be the rather ugly composite of guid, varchar, and datetime.
Answers (1)
I have used similar techniques before for similar reasons and with good success. To get the deterministic qualities that you want, you might try coercing the the composite natural key column values to strings, string-concatenating them together, and then generating an MD5 hash from that to use as your deterministic primary key.
Some considerations:
- Case-sensitivity. Unless some of your business keys are meant by design to be case-sensitive, it is a good idea to establish a convention in your system to downcase or upcase letters first as 'a' is not the same as 'A' to a hash function. This can help avoid issues if you are creating a key from possibly manually user keyed data. For example, if a user keyed in item number 'itm009876' instead of 'ITM009876', and your various source systems aren't robust enough to conform the value before storing them.
- String coercion: Make sure that you coerce values into strings in a way that makes sense and is very specific. For example, using ISO dates and date times plus time zone, or converting dates and date times to Unix timestamp integers before coercing to string
- String delimiter. Use a good string separater between the strings before concatenation, such as ';'. (E.g., string concatenation of A+CB should not be the same as AB+C)
- Store hash as binary: If possible store the MD5 hash as a 16-byte binary value on the table, and use a HEX() function to display it in a human readable format. Storing an MD5 hash as binary uses exactly half of the amount of space it would take to store a 32 byte hexadecimal string, which has advantages for performance of lookups and joins because it is both shorter and completely avoids any possible cycles wasted on special string comparison logic.
Pros
- May avoid accidental duplication of row data at times
- May avoid unnecessary round trips to single authority that must generate or retrieve serial or UUID surrogate keys.
- Single column keys are easier for end users work with.
- Single column keys are easier for downstream developers writing SQL, generating urls, etc to work with.
- MD5 is old and well established so it's very well supported as an SQL function by most DBMS, so you can use compute them there too as needed without and third party extensions.
- With MD5, collisions are extremely rare. As in more likely that your data center gets destroyed by a meteor than to experience a collision, even with hundreds of billions of rows and a single table. There is quite a bit of robust discussion about this online if you Google for one popular methodology that employs hash keys: 'data vault hash keys'.
Cons
- Collisions are of course still theoretically possible with MD5. Many organizations are still very hesitant about this. So if you must have more bytes on the hash space, and you can live with the potential performance hit during joins and index updates, you can always choose a longer SHA hash.
- Generation is complicated. You must choose and document the algorithm for generating the composite key hashes very well and communicate well with other developers in the organization. Just make sure that everyone is doing it the same way.
- because of the non-sequential nature of hashes, they can be inefficient to query in some scenarios, such as in clustered index tables. Be careful with this as some DBMS's use clustered index as the default - or may not even have any other option - such as MYSQL's InnoDB. Heap tables are generally better, such as is supported/default in PostgreSQL and Microsoft SQL Server.
(Sorry for any typos and grammar mistakes. I am writing this on my mobile phone. I'll try to come back later and clean it up.)