RLH
RLH

Reputation: 15698

Short Text Database Key's vs. Numeric Keys; When is either more efficient than the other?

I am well aware that if I use a nvarchar field as a primary key, or as a foreign key, that this will add some time and space overhead to the usage of the generated index in the majority (if not all) of cases.

As a general rule, using numeric keys are a good idea but under certain common circumstances (small sets of data for instance) it isn't a problem to use text based keys.

However, I am wondering if anyone could provide rigorous information on whether is it MORE efficient, or at least equal, to use text for database keys rather than numeric values under certain circumstances.

Consider a case where a table contains a short list of records. For our example, we'll say we need 50 records. Each record needs an ID. You could use, generic int (or even smallint) numbers (e.g. [1...50]) OR you could assign meaningful, 2 character values to a char(2) field (e.g. [AL, AK, AZ, AR, ... WI]).

In the above case, we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.

The benefit from using the text based key over the numeric key is that the values are readable, which should make it obvious to many that my list of 50 records is likely a list of US States,

As stated, using keys that are smaller or equal in size of a comparable numeric key should be of similar efficiency. However, depending on the architecture and design of the database engine it is possible that in-practice usage may yield unexpected results.

With that stated, is it ever more, equal or less efficient to use any form of text-based value as a key within SQL Server?

I don't need obsessively thorough research results (though I wouldn't mind it), but I am looking for an answer that goes beyond stating what we would expect from a database.

Definitively, how does efficiency of text-based keys compare to numeric-based keys as the size of the text key increases/decreases?

Upvotes: 3

Views: 1338

Answers (2)

Balázs
Balázs

Reputation: 2939

Think in assembly to find out the answer. You stated this:

we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.

This isn't true, as you can't move 2 characters simultaneously in a single instruction (to my knowledge). So even as a char is smaller than a 4-byte int, you have to move them one-by-one into the register to do a comparison. To compare two instances of a 4-byte int, even if it is larger in size, you only need 1 move instruction per int (disregarding that you also need to move them out of the register back into the memory).

So what happens if you use an int:

  • Move one of them into one register
  • Move the other into another
  • Do a comparison operation
  • Move to appropriate memory location depending on the comparison result

In the case of a char, however:

  • Move one of them into one register
  • Move the other into another
  • Do a comparison
    • If you are lucky, and the order can be determined, then done, and the cost is the same as that in the case of ints.
    • If they are equal, rinse and repeat using the subsequent characters until the order or equality can be determined. Obviously, this is more costly.

Point is that on low level, the determining factor is not the data size in this case but the number of instructions needed.

Apart from the low-level stuff:

  • Yes, there might be cases where it simply doesn't matter because of the small amount of data that are not likely to ever change - chemical symbols of primitive elements for example (though I am not sure whether I'd use them as PKs).
  • Generally, you don't use artificial PKs for time and space considerations, but because if they don't have anything to do with in-real-life stuff, they are not subject of change. Can you imagine that the name of a US state ever changes? I can. If it happens, you would have to update the record itself (if the abbreviation changes too, ofc.), and all other records that reference it. If you use an int instead, then your record will have nothing to do with what happens in reality, in which case you only have to update the abbreviation and the state name itself and you can sit back assured that everything is consistent.
  • Comparing short strings is not always as trivial as comparing the numeric value of their binary representations. When you also have to consider internationalization, you need to rely on custom (or framework/platform-provided) logic to compare them. To use my language as an example, the letter 'Á' has a decimal value of 193, which is greater than the value of 66 of letter 'B', yet, in the Hungarian alphabet, 'Á' preceedes 'B'.
  • Using textual data rather than an arificial numeric PK can also cause some fragmentation and the write operations are likely to be slower. The reason for this is that an artificial, monotonously increasing numeric PK will cause your newly created rows to be inserted to the end of the table in all cases thereby avoiding the need to "move stuff around to free up space in between".

Upvotes: 3

John
John

Reputation: 3996

In most cases considerations driven by the business requirements (use cases) will far outweigh any performance differences between numeric v. text keys. Unless you are looking at very large and/or very high throughput systems your choice of primary key type should be based on how the keys will be used rather than any small difference in performance you will see between numeric and text keys.

Upvotes: 3

Related Questions