devklick
devklick

Reputation: 2638

SQL Server - Generate unique ID to compare several columns altogether

In SQL Server, is it possible to generate a GUID using a specific piece of data as an input value. For example,

DECLARE @seed1 VARCHAR(10) = 'Test'
DECLARE @seed1 VARCHAR(10) = 'Testing'
SELECT NEWID(@seed1) -- will always return the same output value
SELECT NEWID(@seed2) -- will always return the same output value, and will be different to the example above

I know this completely goes against the point of GUIDs, in that the ID would not be unique. I'm looking for a way to detect duplicate records based on certain criteria (the @seed value).

I've tried generating a VARBINARY string using the HASHBYTES function, however joining between tables using VARBINARY seems extremely slow. I'm hoping to find a similar alternative that is more efficient.

Edit: for more information on why I'm looking to achieve this.

I'm looking for a fast and efficient way of detecting duplicate information that exists on two tables. For example, I have first name, last name & email. When these are concatenated, should can be used to check whether these records eexists in table A and table B.

Simply joining on these fields is possible and provides the correct result, however is quite slow. Therefore, I was hoping to find a way of transforming the data into something such as a GUID, which would make the joins much more efficient.

Upvotes: 0

Views: 2863

Answers (3)

The_ADVERSARY
The_ADVERSARY

Reputation: 1

The simplest method;

DECLARE @seed1 VARCHAR(10) = 'Test'

DECLARE @seed2 VARCHAR(10) = 'Testing'

SELECT CAST(HASHBYTES('SHA2_512',@seed1) as uniqueidentifier)

SELECT CAST(HASHBYTES('SHA2_512',@seed2) as uniqueidentifier)

Upvotes: 0

sniperd
sniperd

Reputation: 5274

You should use hashbytes and not checksum like this:

SELECT hashbytes('MD5', 'JOHN' + ',' + 'SMITH' + ',' + '[email protected]')

Although it's only a small chance checksum can produce the same number with 2 completely different values, I've had it happen with datasets of around a million. As iamdave noted (thanks!), it's a good idea to throw in some kind delimiter (a comma in my example) so that you don't compare 'JOH' + 'NSMITH' and 'JOHN' + 'SMITH' as the same.

http://www.sqlservercentral.com/blogs/microsoft-business-intelligence-and-data-warehousing/2012/02/01/checksum-vs-hashbytes/

Upvotes: 2

LIDEN
LIDEN

Reputation: 166

I think you can use CHECKSUM function for returning int type.

Upvotes: 4

Related Questions