mHelpMe
mHelpMe

Reputation: 6668

understanding how a primary key works & how to use it

I am using SQL Server and creating a table (example is at the very bottom of this question). However I am having some issues understanding how primary keys actually work & how to use them properly.

So I know a primary key ensures all the rows in a table are unique and that a primary key can't be null. I also read this page index basics - simple talk on indices and how indices are organised in a b-tree structure.

So in my table for a row to have a unique value I would have to use the first 3 columns (UploadDate, SecID & FundCode of types datetime, varchar(12) & varchar(6)). Only select queries will be used on this table & the where clause will be using one or more of the the three fields just mentioned.

So I know can create a primary key over multiple columns so in my case it would be the 3 above. How though does having a primary key on my table help to improve the performance of select queries? So I take it the primary key creates an index or some sort with the value of of your column (or in my case 3 columns), I don't see how this will help as my value will be a datetime & two bits of text?

Someone mentioned that I should just create an integer column with incrementing numbers and make that the primary key - I can't see how that can help when running a select query as the new field won't have any meaning & wouldn't be used in any select query or where clause of the query?

b tree

type             column name
-------------    ------------
datetime         UploadDate
varchar(12)      SecID
varchar(6)       FundCode
varchar(100)     Name
float            Price
float            Nominal
int              SourceCode
datetime         PriceDate

example of a some rows

UploadDate   SecID    FundCode   Name   Price   Nominal   SourceCode   PriceDate
2015-08-20   A045     ABCVPL     Joe    1.3434  1000.33   3
2015-08-20   A563     ABCVPL     Bob    1.5961  10.33     3
2015-08-20   A045     DEFGHJ     Joe    1.3434  856.41    3
2015-08-20   XC45     PLMNOI     Pip    2.3654  25.52     3
2015-08-20   KMM5     ABCVPL     Nit    6.9565  1532      3
2015-08-21   A045     ABCVPL     Joe    4.3434  1112      3
2015-08-21   GH45     DEFGHJ     Joe    3.3434  16532     3
2015-08-21   PL34     DEFGHJ     Joe    7.3434  635       3
2015-08-21   ER33     ABCVPL     Joe    8.3434  6320      3

Upvotes: 2

Views: 2990

Answers (3)

GarethD
GarethD

Reputation: 69769

The question appears to confuse two different concepts. The first is a primary key, the second is a clustered index. The first is a logical concept, the latter is a physical concept and refers to how the data is actually stored. There are cases when it is useful to decouple the primary key and clustering key, but for the most part they are one and the same, and by default your primary key will be your clustering key. It is an important distinction nonetheless.

I think people can (and have) argued until the cows come home about whether to use a natural or surrogate primary key. I won't touch upon this too much, but the basic is what you are suggesting when using the 3 columns that would define a unique row is a natural key (i.e. already exists in your data), and another approach is to use an identity column, which will give each row a unique value, this is a surrogate key since it has no actual meaning other than to uniquely identify your row.

So I know can create a primary key over multiple columns so in my case it would be the 3 above. How though does having a primary key on my table help to improve the performance of select queries?

It doesn't, having an index might help depending on your queries. Given the right index the database engine is able to navigate directly to the required data.

Someone mentioned that I should just create an integer column with incrementing numbers and make that the primary key - I can't see how that can help when running a select query as the new field won't have any meaning & wouldn't be used in any select query or where clause of the query?

This is a good candidate for the clustering key. According to the queen of indexing Kimberly Tripp a clustered index should be:

  • Unique
  • Narrow
  • Static
  • Ever increasing pattern

You have already ticked the unique box, with your 3 columns, this is not that narrow, but not wide by any means. The second I can't answer, if UploadDate is a default value that is entered at the time of creation then you could have an ever increasing pattern, and I have no idea if your three columns are static or they could change. If either of these last two are true then you should be using a surrogate identity column to cluster on regardless.

I would personally probably have eliminated this as a candidate for a clustering key based on the with (26 bytes). You have an extra 4 bytes per row in the clustered index, but you save 22 bytes per row in all subsequent indexes.

So in a table of 10,000,000 rows you gain an extra 38.1 MB due to the identity column, however you gain 209.8MB for each non clustered index, although disk space is cheap, it is not a reason to waste it unnecessarily. It is not just all indexes that gain these 22 bytes, it is also all referencing tables with foreign keys, which leads to my next point, convenience when writing queries. Do you really want to have to type out this join each time you refer to the key:

SELECT  *
FROM    Parent AS p
        INNER JOIN Child AS c
            ON c.UploadDate = p.UploadDate
            AND c.SecID = p.SecID
            AND c.FundCode = p.FundCode;

Or would you rather simply write:

SELECT  *
FROM    Parent AS p
        INNER JOIN Child AS c
            ON c.ParentID = p.ParentID;

For this reason, even if I have decided that what is logically a primary key is not a good candidate for a clustering key, I tend to still make the clustering key the primary key for ease of reference in relation tables. For example I have an external API that sends me order details in XML:

<orders>
    <order ID="12B47EF2-B9F5-4CD7-811F-2E7EC1A67E59">
        <orderdetail>
            <product>Some Product</product>
            <quantity>1</quantity</quantity>
        </orderdetail>
        <orderdetail>
            <product>Some Other Product</product>
            <quantity>2</quantity</quantity>
        </orderdetail>
    </order>
    <order ID="3A819217-49CA-4B4C-8AD5-CAD297FCA3F3">
        <etc />
    </order>
</orders>

If I was setting up my tables to store this, although the ID from XML would be the logical primary key for my Orders table, it would be a terrible clustering key, so I would add a surrogate identity field to avoid the fragmentation associated with clustering on a GUID:

CREATE TABLE dbo.Orders
(
        OrderID INT IDENTITY NOT NULL,
        SupplierOrderID UNIQUEIDENTIFIER NOT NULL,

    CONSTRAINT PK_Orders__SupplierOrderID PRIMARY KEY NONCLUSTERED (SupplierOrderID)
);
CREATE UNIQUE CLUSTERED INDEX UQ_Orders__OrderID ON dbo.Orders (OrderID);

The GUID is still the primary key, so my order detail table can refer to this, but I generally think that if I don't consider the key good enough to cluster on, why would I then put the same key into another table as a foreign key. I have already defined a more narrow key in OrderID, why not just use this as my foreign key in order details, and save myself 12 bytes. So I would end up with:

CREATE TABLE dbo.Orders
(
        OrderID INT IDENTITY NOT NULL,
        SupplierOrderID UNIQUEIDENTIFIER NOT NULL,

    CONSTRAINT PK_Orders__OrderID PRIMARY KEY CLUSTERED (OrderID)
);
CREATE UNIQUE NONCLUSTERED INDEX UQ_Orders__SupplierOrderID ON dbo.Orders (SupplierOrderID);

As with everything, there are exceptions, and there are cases where I would choose the 3 columns as a composite (clustered) primary key, and this woule be if I knew there would be no child tables, and that all my select queries would still require me to select UploadedDate, SecID, and FundCode. If you had an index on Name for example:

CREATE NONCLUSTERED INDEX IX_YourTable__Name ON dbo.YourTable (Name);

SELECT  UploadDate, SecID, FundCode, Name
FROM    dbo.YourTable
WHERE   Name = 'Bob';

If you have a surrogate key, then you will seek through the name index and find Bob at row 2 only, then lookup row 2 on your clustered index to get the corresponding values for UploadedDate, SecID, and FundCode. If these three columns are your clustering key, then you remove the need for the lookup since you already have the data in the name index. The extra 209.8MB on each index could be worth it to avoid these lookup operations.

In summary (as usual), it depends - it depends on both your personal preference (I believe Aaron Bertrand and Joe Celko are still at loggerheads on the natural vs surrogate key debate, and if these two great minds can't agree, then the answer really has to be personal preference), and also your exact situation, in some situations you will want a composite primary key, in some instances you will want a surrogate key, in some instances you will want your primary key and your clustering key to be the same thing, in other instances you won't.

Upvotes: 3

Matt
Matt

Reputation: 15091

The primary key itself does not speed up queries which do not depend upon it. Other DBMSs may not rely on existence of primary key (e.g. Oracle's default is heap table).

It's just MSSQL "feature" - it "likes" primary key. It was optimized for it (for that tree-structure actually). Other engine may not benefit of simple autoincrement field.

Upvotes: -1

Gottfried Lesigang
Gottfried Lesigang

Reputation: 67311

As your UploadDate carries an implicit sorting it could be wise to have a clustered multi-column Primary Key. Any other key will be able to use this as look up key, because the clustered key does implicitly cover all columns. Thus you should create this first...

An additional int key with random numbers should NOT be clustered (it will be extremely fragmented!).

Well, many people say, that (in almost all cases) the PK should not carry any information. This really depends on you needs.

Upvotes: 1

Related Questions