Reputation: 6668
I am using SQL Server and creating a table (example is at the very bottom of this question). However I am having some issues understanding how primary keys actually work & how to use them properly.
So I know a primary key ensures all the rows in a table are unique and that a primary key can't be null. I also read this page index basics - simple talk on indices and how indices are organised in a b-tree structure.
So in my table for a row to have a unique value I would have to use the first 3 columns (UploadDate, SecID & FundCode of types datetime, varchar(12) & varchar(6)). Only select queries will be used on this table & the where clause will be using one or more of the the three fields just mentioned.
So I know can create a primary key over multiple columns so in my case it would be the 3 above. How though does having a primary key on my table help to improve the performance of select queries? So I take it the primary key creates an index or some sort with the value of of your column (or in my case 3 columns), I don't see how this will help as my value will be a datetime & two bits of text?
Someone mentioned that I should just create an integer column with incrementing numbers and make that the primary key - I can't see how that can help when running a select query as the new field won't have any meaning & wouldn't be used in any select query or where clause of the query?
type column name
------------- ------------
datetime UploadDate
varchar(12) SecID
varchar(6) FundCode
varchar(100) Name
float Price
float Nominal
int SourceCode
datetime PriceDate
example of a some rows
UploadDate SecID FundCode Name Price Nominal SourceCode PriceDate
2015-08-20 A045 ABCVPL Joe 1.3434 1000.33 3
2015-08-20 A563 ABCVPL Bob 1.5961 10.33 3
2015-08-20 A045 DEFGHJ Joe 1.3434 856.41 3
2015-08-20 XC45 PLMNOI Pip 2.3654 25.52 3
2015-08-20 KMM5 ABCVPL Nit 6.9565 1532 3
2015-08-21 A045 ABCVPL Joe 4.3434 1112 3
2015-08-21 GH45 DEFGHJ Joe 3.3434 16532 3
2015-08-21 PL34 DEFGHJ Joe 7.3434 635 3
2015-08-21 ER33 ABCVPL Joe 8.3434 6320 3
Upvotes: 2
Views: 2990
Reputation: 69769
The question appears to confuse two different concepts. The first is a primary key, the second is a clustered index. The first is a logical concept, the latter is a physical concept and refers to how the data is actually stored. There are cases when it is useful to decouple the primary key and clustering key, but for the most part they are one and the same, and by default your primary key will be your clustering key. It is an important distinction nonetheless.
I think people can (and have) argued until the cows come home about whether to use a natural or surrogate primary key. I won't touch upon this too much, but the basic is what you are suggesting when using the 3 columns that would define a unique row is a natural key (i.e. already exists in your data), and another approach is to use an identity column, which will give each row a unique value, this is a surrogate key since it has no actual meaning other than to uniquely identify your row.
So I know can create a primary key over multiple columns so in my case it would be the 3 above. How though does having a primary key on my table help to improve the performance of select queries?
It doesn't, having an index might help depending on your queries. Given the right index the database engine is able to navigate directly to the required data.
Someone mentioned that I should just create an integer column with incrementing numbers and make that the primary key - I can't see how that can help when running a select query as the new field won't have any meaning & wouldn't be used in any select query or where clause of the query?
This is a good candidate for the clustering key. According to the queen of indexing Kimberly Tripp a clustered index should be:
You have already ticked the unique box, with your 3 columns, this is not that narrow, but not wide by any means. The second I can't answer, if UploadDate
is a default value that is entered at the time of creation then you could have an ever increasing pattern, and I have no idea if your three columns are static or they could change. If either of these last two are true then you should be using a surrogate identity column to cluster on regardless.
I would personally probably have eliminated this as a candidate for a clustering key based on the with (26 bytes). You have an extra 4 bytes per row in the clustered index, but you save 22 bytes per row in all subsequent indexes.
So in a table of 10,000,000 rows you gain an extra 38.1 MB due to the identity column, however you gain 209.8MB for each non clustered index, although disk space is cheap, it is not a reason to waste it unnecessarily. It is not just all indexes that gain these 22 bytes, it is also all referencing tables with foreign keys, which leads to my next point, convenience when writing queries. Do you really want to have to type out this join each time you refer to the key:
SELECT *
FROM Parent AS p
INNER JOIN Child AS c
ON c.UploadDate = p.UploadDate
AND c.SecID = p.SecID
AND c.FundCode = p.FundCode;
Or would you rather simply write:
SELECT *
FROM Parent AS p
INNER JOIN Child AS c
ON c.ParentID = p.ParentID;
For this reason, even if I have decided that what is logically a primary key is not a good candidate for a clustering key, I tend to still make the clustering key the primary key for ease of reference in relation tables. For example I have an external API that sends me order details in XML:
<orders>
<order ID="12B47EF2-B9F5-4CD7-811F-2E7EC1A67E59">
<orderdetail>
<product>Some Product</product>
<quantity>1</quantity</quantity>
</orderdetail>
<orderdetail>
<product>Some Other Product</product>
<quantity>2</quantity</quantity>
</orderdetail>
</order>
<order ID="3A819217-49CA-4B4C-8AD5-CAD297FCA3F3">
<etc />
</order>
</orders>
If I was setting up my tables to store this, although the ID from XML would be the logical primary key for my Orders
table, it would be a terrible clustering key, so I would add a surrogate identity field to avoid the fragmentation associated with clustering on a GUID:
CREATE TABLE dbo.Orders
(
OrderID INT IDENTITY NOT NULL,
SupplierOrderID UNIQUEIDENTIFIER NOT NULL,
CONSTRAINT PK_Orders__SupplierOrderID PRIMARY KEY NONCLUSTERED (SupplierOrderID)
);
CREATE UNIQUE CLUSTERED INDEX UQ_Orders__OrderID ON dbo.Orders (OrderID);
The GUID is still the primary key, so my order detail table can refer to this, but I generally think that if I don't consider the key good enough to cluster on, why would I then put the same key into another table as a foreign key. I have already defined a more narrow key in OrderID
, why not just use this as my foreign key in order details, and save myself 12 bytes. So I would end up with:
CREATE TABLE dbo.Orders
(
OrderID INT IDENTITY NOT NULL,
SupplierOrderID UNIQUEIDENTIFIER NOT NULL,
CONSTRAINT PK_Orders__OrderID PRIMARY KEY CLUSTERED (OrderID)
);
CREATE UNIQUE NONCLUSTERED INDEX UQ_Orders__SupplierOrderID ON dbo.Orders (SupplierOrderID);
As with everything, there are exceptions, and there are cases where I would choose the 3 columns as a composite (clustered) primary key, and this woule be if I knew there would be no child tables, and that all my select queries would still require me to select UploadedDate
, SecID
, and FundCode
. If you had an index on Name
for example:
CREATE NONCLUSTERED INDEX IX_YourTable__Name ON dbo.YourTable (Name);
SELECT UploadDate, SecID, FundCode, Name
FROM dbo.YourTable
WHERE Name = 'Bob';
If you have a surrogate key, then you will seek through the name index and find Bob at row 2 only, then lookup row 2 on your clustered index to get the corresponding values for UploadedDate
, SecID
, and FundCode
. If these three columns are your clustering key, then you remove the need for the lookup since you already have the data in the name
index. The extra 209.8MB on each index could be worth it to avoid these lookup operations.
In summary (as usual), it depends - it depends on both your personal preference (I believe Aaron Bertrand and Joe Celko are still at loggerheads on the natural vs surrogate key debate, and if these two great minds can't agree, then the answer really has to be personal preference), and also your exact situation, in some situations you will want a composite primary key, in some instances you will want a surrogate key, in some instances you will want your primary key and your clustering key to be the same thing, in other instances you won't.
Upvotes: 3
Reputation: 15091
The primary key itself does not speed up queries which do not depend upon it. Other DBMSs may not rely on existence of primary key (e.g. Oracle's default is heap table).
It's just MSSQL "feature" - it "likes" primary key. It was optimized for it (for that tree-structure actually). Other engine may not benefit of simple autoincrement field.
Upvotes: -1
Reputation: 67311
As your UploadDate carries an implicit sorting it could be wise to have a clustered multi-column Primary Key. Any other key will be able to use this as look up key, because the clustered key does implicitly cover all columns. Thus you should create this first...
An additional int key with random numbers should NOT be clustered (it will be extremely fragmented!).
Well, many people say, that (in almost all cases) the PK should not carry any information. This really depends on you needs.
Upvotes: 1