Reputation: 653
It is better to use primary key when design tables.
But when designing a primary key, is it needed to set auto_increment?
What's the benefit?
I heard that can keep b-trees stable, but why?
If table has a unique column, which is better: set the unique column as primary key or add a new column id
as auto_increment primary key?
Upvotes: 6
Views: 13553
Reputation: 562260
I want to know when design a primary key, it is needed to setting auto_increment?
No, it's not strictly necessary. There are cases when a natural key is fine.
If done, what's the benefit?
Advantages of using an auto-increment surrogate key:
Advantages of using a natural key:
shoes
referencing colors
would need to do a join if you wanted to get the color name. But if you use the color name as the primary key of colors
, then that value would already be part of the shoes
table.Other cases when a surrogate auto-increment key is not needed:
I listen, that can keep b-tree's stable, but i don't know why?
Inserting a value into an arbitrary place in the middle of a B-tree may cause a costly restructuring of the index.
There's an animated example here: http://www.bluerwhite.org/btree/
Look at the example "Inserting Key 33 into a B-Tree (w/ Split)" where it shows the steps of inserting a value into a B-tree node that overfills it, and what the B-tree does in response.
Now imagine that the example illustration only shows the bottom part of a B-tree that is much deeper (as would be in the case of an index B-tree has millions of entries), and filling the parent node can itself be an overflow, and force the splitting operation to continue up the the higher level in the tree. This can continue all the way to the very top of the tree if all the ancestor nodes to the top of the tree were already filled.
As the nodes split and have to be restructured, they may require more space, but they're stored on some page of the database file where there's no spare space. So the storage engine has to relocate parts of the index to another part of the file, and potentially re-write a lot of pages of index just for a single INSERT.
Auto-increment values are naturally always inserted at the very rightmost edge of the B-tree. As @ BrankoDimitrijevic points out in a comment below, this does not make it less likely that they'll cause such laborious node-splitting and restructuring to the index. But the B-tree implementation code can optimize for this case in other ways, and some do.
If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?
If the unique column is also non-nullable, then you can use it as a primary key. Primary keys require that all of their columns are non-nullable.
Upvotes: 11
Reputation: 2021
Thanks to @Branko Dimitrijevic for opening the topic of relational database primary keys (PK's) by describing the role of SURROGATE KEYS
and getting to the center of the discussion. Surrogate keys are by definition devoid of any intrinsic meaning aside from their uniqueness between their values in each record of their table.
Thanks also to @Mattias Åslund for your additional wisdom:
Mind you most things that "will never change" tends to do so anyway given enough time, like someones social security number...
I add that even if the value chosen as "unchangeable" really does not change, it is also just as likely for the rules of the supported business or organization itself to drift and change over time in ways that may affect the core assumptions of given design.
A useful discussion on the integration of demographic and biometric key values for tracking individuals can be found in this section on Choosing an Appropriate Key for New Databases put out by the Computer Professionals for Social Responsibility.
I plan to approach the comments on this post with a discussion around a specific example design to explain the kinds of things that can go wrong with assigning a Primary Key that isn't a surrogate key. Many of these assumptions are taken from observations from real applications. They are remembered well because of the complexity their design introduced into other business processes as other systems and data sources became dependent on their assumptions.
The design and sample data is below, borrowed loosely from Oracle's infamous Scott/TIGER database design.
MySQL 5.5.32 Schema Setup:
CREATE TABLE employee
(
fake_ssn varchar(15) primary key,
last_name varchar(40),
first_name varchar(40),
dept_id varchar(15),
hire_date date,
salary int,
email varchar(100)
);
INSERT INTO employee
(fake_ssn, last_name, first_name, dept_id, hire_date, salary,
email)
VALUES
('130-60-0101', 'MARLOWE', 'JACOB', '1200-05', date('2009/01/25'),
8000, '[email protected]'),
('967-22-5025', 'CRACHITT', 'BOB', '1200-05', date('2010/02/05'),
500, '[email protected]'),
('040-36-5555', 'PERRY', 'VICTORIA', '1200-02', date('2011/05/25'),
2700, '[email protected]'),
('203-89-1010', 'STEVENS', 'KEVIN', '2955-03', date('2007/04/25'),
1800, '[email protected]'),
('409-99-1111', 'MCLANE', 'JONATHAN', '2955-03', date('2009/03/02'),
4200, '[email protected]');
CREATE TABLE department
(
dept_id varchar(15) primary key,
dept_manager varchar(40),
dept_title varchar(40)
);
INSERT INTO department
(dept_id, dept_manager, dept_title)
VALUES
('1200-05', 'MARLOWE', 'FINANCE'),
('1200-02', null, 'HR'),
('2955-03', 'JOHNM', 'MARKETING');
COMMIT;
Query 1:
SELECT fake_ssn, last_name, first_name, dept_id, hire_date,
salary, email
FROM employee
| FAKE_SSN | LAST_NAME | FIRST_NAME | DEPT_ID | HIRE_DATE | SALARY | EMAIL |
|-------------|-----------|------------|---------|---------------------------------|--------|--------------------------------|
| 040-36-5555 | PERRY | VICTORIA | 1200-02 | May, 25 2011 00:00:00+0000 | 2700 | [email protected] |
| 130-60-0101 | MARLOWE | JACOB | 1200-05 | January, 25 2009 00:00:00+0000 | 8000 | [email protected] |
| 203-89-1010 | STEVENS | KEVIN | 2955-03 | April, 25 2007 00:00:00+0000 | 1800 | [email protected] |
| 409-99-1111 | MCLANE | JONATHAN | 2955-03 | March, 02 2009 00:00:00+0000 | 4200 | [email protected] |
| 967-22-5025 | CRACHITT | BOB | 1200-05 | February, 05 2010 00:00:00+0000 | 500 | [email protected] |
Query 2:
SELECT dept_id, dept_manager, dept_title
FROM department
| DEPT_ID | DEPT_MANAGER | DEPT_TITLE |
|---------|--------------|------------|
| 1200-02 | (null) | HR |
| 1200-05 | MARLOWE | FINANCE |
| 2955-03 | JOHNM | MARKETING |
(The FAKE
name is just a reminder that these are all randomly generated values.) Although this is often a popular "unqiue" value in personnel records and databases, according to the U.S. Social Security Administration, this value is not unique. It is also problematic because this value and its transfer is subject to strict regulation because of recently passed privacy laws.
Even with the additional combinations created by including the middle initial, somehow there are still just too many people with the same name. A look at what the Social Security Administration has to say about the registered names of babies born in 2012:
Two decades from now, when the JACOB's and SOPHIA's of 2012 graduate from school, they will flood into the workforce beside thousands of others like them...
Name changes by marriage or legal reasons are also threats to the referential integrity of database records relying on their values as business key values.
Some companies will try to derive keys from other values to make SMART KEYS
. These types of keys when observed in practice are NOT smart at all. The values in the example: 1200-02
, 1200-05
, 2955-03
were intended to resemble a "smart key". The first value might be a street address or building number for a company campus or multiple-location business. The second value, ("-02", "-03", "-05") might be the floor of the building where the department is located.
Changing buildings, moving a department or relocating a business completely would render this location dependency of the DEPARTMENT ID
useless.
This one is subtle, but there is a hole in this relational connection. A MANAGER
is ALSO an employee, which makes the relational join between EMPLOYEE
and DEPARTMENT
a circular one:
MANAGER
(from DEPARTMENT
) a foreign key constraint on the EMPLOYEE
table, orDEPT_ID
(FROM EMPLOYEE
) a foreign key constraint on the DEPARTMENT
table?If you give up the Foreign Key constraint between MANAGER
and some key column on EMPLOYEE
(LAST_NAME
or FIRST_NAME
+ LAST_NAME
), you risk non uniform values for MANAGER
.
... Looking at
Query of The DEPARTMENT
Table:
| DEPT_ID | DEPT_MANAGER | DEPT_TITLE |
|---------|--------------|------------|
| 1200-02 | (null) | HR |
| 1200-05 | MARLOWE | FINANCE |
| 2955-03 | JOHNM | MARKETING |
The misplacement of DEPT_MANAGER
within the DEPARTMENT
table, as there are three different ways the name of the Department Manager are represented: none (null), ALL-CAPS Last Name, ALL-CAPS First Name, Last Initial.
An important lesson to take from this posting is that making a key more than a key either by integrating derived values, making values based on assumptions about business rules tie down the flexibility of a database design because if the business rules change, so will the values such as Primary Keys or Joining Key values.
As the developer or maintainer of the business application, you (or your team) are better able to support the prevailing business conditions if you have taken control and own the parts that represent the internal structure of the business application itself. The primary key might never actually present itself in a customer or user-facing situation, but it should be protected so that the relations it represents does not change with time.
Image Credit from the popular Baby Names page of 2012:
http://www.ssa.gov/OACT/babynames/#ht=0
Upvotes: -1
Reputation: 52107
I know it is better to use primary key when design tables.
In fact, a key-less table is a multiset (since it allows duplicates), and therefore not strictly a relation (which is a set), and therefore your database would not really be "relational".
Please note that "primary" (PRIMARY KEY) and "alternate" (UNIQUE constraint) keys are logically equivalent.
But I want to know when design a primary key, it is needed to setting auto_increment?
You are actually asking multiple questions:
The answer to (1) is "almost always". There are some very rare cases when data is not "important" and you might skip it for performance reasons, but that's exceedingly rare.
The answer to (2) is "it depends" - main pros and cons can be found here.
The answer to (3) depends on whether you need to generate keys independently from the database (e.g. while disconnected, or while connected to a different database). If yes, you could use GUIDs (which obviously cannot be auto-incremented, but can be uniquely generated in isolation). If no, then you can just use integers - they are more compact and generally faster.
And finally, if you reached (4), then you'll almost certainly want to make it auto-incremented, for the reasons discussed below.
If done, what's the benefit?
SELECT MAX(ID) + 1 FROM ...
, there is no guarantee some other client will not try to do the same thing at the same time, and end-up with the same result (subsequently causing a key violation).If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?
If the attribute is intrinsically unique at the "logical level", then the corresponding table column must be made unique (through PRIMARY KEY or UNIQUE constraint), whether you later decide to add a surrogate key or not.
Upvotes: 6
Reputation: 3907
To have an auto-increment PK makes it easy to create a key that never needs to change, which in turn makes it easy to reference in other tables.
If your data is such that you have natural columns that are unique and can never change you can use them just as well. Mind you most things that "will never change" tends to do so anyway given enough time, like someones social security number...
For simplicity I always use auto-increment (identity) columns for PK.
Upvotes: 1