SQL Server : big data replication primary key

I know that this subject has been discussed may times but there are different opinions about it. My scenario is this I have created a database which is going to be filled with 4 billion records and each year will be added between 1 to 2 million records.

We have servers in the USA and Europe and we do replication of the database to keep them similar on these servers for example same thing that Facebook does with replication.

My question is this as a primary key of the tables what should I use - BigInt or Uniqueidentifier, or it does not make any difference what I use for the replication?

Should I create a non-clustered uniqueidentifier primary key and then add another clustered bigInt column?

Should I create a clustered bigint primary key?

Upvotes: 2

Answers (2)

marc_s

Reputation: 755023

I would argue just the other option: I would try to AVOID uniqueidentifier columns - MOST DEFINITELY as your clustering key!

The clustering key is the most replicated data structure in SQL Server - and with millions and millions of rows, it does makes a huge difference if your clustering key is 8 or 16 bytes in size. Not to mention the number of page splits a uniqueidentifier clustering key would introduce - which you can totally avoid with a clustering key of BIGINT type.

If you're really interested - you must read all these articles from Kimberly Tripp - the "Queen of Indexing" in the SQL Server space - that clearly shows just how bad and counter-productive a GUID as your clustering key can be:

Upvotes: 0

ChrisLively

Reputation: 88072

Without a doubt, go with a Uniqueidentifier.

Do not add a bigint column, you don't need it.

If you use merge replication and you don't have a uniqueidentifier then the server is going to add that column anyway.

By using a GUID, you now have the capability of setting up a multi-master DB architecture. If you use a bigint as an identity field then you either force yourself to only use a single master (to control the bigint) or you then have to come up with a scheme to keep multiple servers from colliding with each other. Further by using GUIDs you get away from guessable IDs - which is generally a good thing.

My own testing in the hundred million record range with millions added / deleted daily showed no performance drop when using GUIDs vs ints for ids.

Final note - most places base64 encode the guid when calling web services or if it is going to be displayed anywhere - like in the address bar.

Upvotes: 1

SQL Server : big data replication primary key

Answers (2)

Related Questions