I receive raw data files from external sources and need to provide analysis on them. I load the files into a table & set the fields as varchars, then run a complex SQL script that does some automated analysis. One issue I've been trying to resolve is: How to tell if a column of data is duplicated with 1 or more other columns in that same table? My goal is to have, for every column, a hash, checksum, or something similar that looks at a column's values in every row in the order they come in . I have dynamic SQL that loops through every field (different tables will have a variable number of columns) based on the fields listed in INFORMATION_SCHEMA.COLUMNS, so no concerns on how to accomplish that part. I've been researching this all day but can't seem to find any sensible way to hash every row of a field. Google & StackOverflow searches return how to do various things to rows of data, but I couldn't find much on how to do the same thing vertically on a field. So, I considered 2 possibilities & hit 2 roadblocks: HASHBYTES - Use 'FOR XML PATH' (or similar) to grab every row & use a delimiter between each row, then use HASHBYTES to hash the long string. Unfortunately, this won't work for me since I'm running SQL Server 2014 , and HASHBYTES is limited to an input of 8000 characters. (I can also imagine performance would be abysmal on tables with millions of rows, looped for 200+ columns). CHECKSUM + CHECKSUM_AGG - Get the CHECKSUM of each value, turning it into an integer, then use CHECKSUM_AGG on the results (since CHECKSUM_AGG needs integers). This looks promising, but the order of the data is not considered, returning the same value on different rows. Plus the risk of collisions is higher. The second looked promising but doesn't work as I had hoped... declare @t1 table (col_1 varchar(5) , col_2 varchar(5) , col_3 varchar(5)); insert into @t1 values ('ABC', 'ABC', 'ABC') , ('ABC', 'ABC', 'BCD') , ('BCD', 'BCD', NULL) , (NULL, NULL, 'ABC'); select * from @t1; select cs_1 = CHECKSUM(col_1) , cs_2 = CHECKSUM(col_2) , cs_3 = CHECKSUM(col_3) from @t1; select csa_1 = CHECKSUM_AGG(CHECKSUM([col_1])) , csa_2 = CHECKSUM_AGG(CHECKSUM([col_2])) , csa_3 = CHECKSUM_AGG(CHECKSUM([col_3])) from @t1; In the last result set, all 3 columns bring back the same value: 2147449198. Desired results: My goal is to have some code where csa_1 and csa_2 bring back the same value, while csa_3 brings back a different value, indicating that it's its own unique set.

sqlsql-servert-sqlhashsql-server-2014

Reputation: 33

How do I hash a column of a table in SQL Server?

I receive raw data files from external sources and need to provide analysis on them. I load the files into a table & set the fields as varchars, then run a complex SQL script that does some automated analysis. One issue I've been trying to resolve is: How to tell if a column of data is duplicated with 1 or more other columns in that same table?

My goal is to have, for every column, a hash, checksum, or something similar that looks at a column's values in every row in the order they come in. I have dynamic SQL that loops through every field (different tables will have a variable number of columns) based on the fields listed in INFORMATION_SCHEMA.COLUMNS, so no concerns on how to accomplish that part.

I've been researching this all day but can't seem to find any sensible way to hash every row of a field. Google & StackOverflow searches return how to do various things to rows of data, but I couldn't find much on how to do the same thing vertically on a field.

So, I considered 2 possibilities & hit 2 roadblocks:

HASHBYTES - Use 'FOR XML PATH' (or similar) to grab every row & use a delimiter between each row, then use HASHBYTES to hash the long string. Unfortunately, this won't work for me since I'm running SQL Server 2014, and HASHBYTES is limited to an input of 8000 characters. (I can also imagine performance would be abysmal on tables with millions of rows, looped for 200+ columns).
CHECKSUM + CHECKSUM_AGG - Get the CHECKSUM of each value, turning it into an integer, then use CHECKSUM_AGG on the results (since CHECKSUM_AGG needs integers). This looks promising, but the order of the data is not considered, returning the same value on different rows. Plus the risk of collisions is higher.

The second looked promising but doesn't work as I had hoped...

declare @t1 table
    (col_1 varchar(5)
    , col_2 varchar(5)
    , col_3 varchar(5));

insert into @t1
values ('ABC', 'ABC', 'ABC')
    , ('ABC', 'ABC', 'BCD')
    , ('BCD', 'BCD', NULL)
    , (NULL, NULL, 'ABC');

select * from @t1; 

select cs_1 = CHECKSUM(col_1)
    , cs_2 = CHECKSUM(col_2)
    , cs_3 = CHECKSUM(col_3)
from @t1;

select csa_1 = CHECKSUM_AGG(CHECKSUM([col_1]))
    , csa_2 = CHECKSUM_AGG(CHECKSUM([col_2]))
    , csa_3 = CHECKSUM_AGG(CHECKSUM([col_3]))
from @t1;

In the last result set, all 3 columns bring back the same value: 2147449198.

Desired results: My goal is to have some code where csa_1 and csa_2 bring back the same value, while csa_3 brings back a different value, indicating that it's its own unique set.

Upvotes: 3

Answers (2)

Pittsburgh DBA

Reputation: 6783

NEW SOLUTION

EDIT: Based on some new information, namely that there may be more than 200 columns, my suggestion is to compute hashes for each column, but perform it in the ETL tool.

Essentially, feed your data buffer through a transformation that computes a cryptographic hash of the previously-computed hash concatenated with the current column value. When you reach the end of the stream, you will have serially-generated hash values for each column, that are a proxy for the content and order of each set.

Then, you can compare each to all of the others almost instantly, as opposed to running 20,000 table scans.

OLD SOLUTION

Try this. Basically, you'll need a query like this to analyze each column against the others. There is not really a feasible hash-based solution. Just compare each set by its insertion order (some sort of row sequence number). Either generate this number during ingestion, or project it during retrieval, if you have a computationally-feasible means of doing so.

NOTE: I took liberties with the NULL here, comparing it as an empty string.

declare @t1 table
    (
    rownum int identity(1,1)
    , col_1 varchar(5)
    , col_2 varchar(5)
    , col_3 varchar(5));

insert into @t1
values ('ABC', 'ABC', 'ABC')
    , ('ABC', 'ABC', 'BCD')
    , ('BCD', 'BCD', NULL)
    , (NULL, NULL, 'ABC');


with col_1_sets as
(
select
    t1.rownum as col_1_rownum
    , CASE WHEN t2.rownum IS NULL THEN 1 ELSE 0 END AS col_2_miss
    , CASE WHEN t3.rownum IS NULL THEN 1 ELSE 0 END AS col_3_miss
from
    @t1 as t1
    left join @t1 as t2 on
        t1.rownum = t2.rownum
        AND isnull(t1.col_1, '') = isnull(t2.col_2, '')
    left join @t1 as t3 on
        t1.rownum = t3.rownum
        AND isnull(t1.col_1, '') = isnull(t2.col_3, '')
),
col_1_misses as
(
select
    SUM(col_2_miss) as col_2_misses
    , SUM(col_3_miss) as col_3_misses
from
    col_1_sets
)
select
    'col_1' as column_name
    , CASE WHEN col_2_misses = 0 THEN 1 ELSE 0 END AS is_col_2_match
    , CASE WHEN col_3_misses = 0 THEN 1 ELSE 0 END AS is_col_3_match
from
    col_1_misses

Results:

+-------------+----------------+----------------+
| column_name | is_col_2_match | is_col_3_match |
+-------------+----------------+----------------+
| col_1       |              1 |              0 |
+-------------+----------------+----------------+

Upvotes: 1

JohnLBevan

Reputation: 24470

You could compare every column combo in this way, rather than using hashes:

select case when count(case when column1 = column2 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn2
, case when count(case when column1 = column3 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn3
, case when count(case when column1 = column4 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn4
, case when count(case when column1 = column5 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn5
, case when count(case when column2 = column3 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn3
, case when count(case when column2 = column4 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn4
, case when count(case when column2 = column5 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn5
, case when count(case when column3 = column4 then 1 else null end) = count(1) then 1 else 0 end Column3EqualsColumn4
, case when count(case when column3 = column5 then 1 else null end) = count(1) then 1 else 0 end Column3EqualsColumn5
, case when count(case when column4 = column5 then 1 else null end) = count(1) then 1 else 0 end Column4EqualsColumn5
from myData a

Here's the setup code:

create table myData
(
  id integer not null identity(1,1)
  , column1 nvarchar (32)
  , column2 nvarchar (32)
  , column3 nvarchar (32)
  , column4 nvarchar (32)
  , column5 nvarchar (32)
)

insert myData (column1, column2, column3, column4, column5) 
values ('hello', 'hello', 'no', 'match', 'match')
,('world', 'world', 'world', 'world', 'world')
,('repeat', 'repeat', 'repeat', 'repeat', 'repeat')
,('me', 'me', 'me', 'me', 'me')

And here's the obligatory SQL Fiddle.

Also, to save you having to write this here's some code to generate the above. This version will also include logic to handle scenarios where both columns' values are null:

declare @tableName sysname = 'myData'
, @sql nvarchar(max) 
;with cte as (
    select name, row_number() over (order by column_id) r
    from sys.columns 
    where object_id = object_id(@tableName, 'U') --filter on our table
    and name not in ('id') --only process for the columns we're interested in
)
select @sql = coalesce(@sql + char(10) + ', ', 'select') + ' case when count(case when ' + quotename(a.name) + ' = ' + quotename(b.name) + ' or (' + quotename(a.name) + ' is null and ' + quotename(b.name) + ' is null) then 1 else null end) = count(1) then 1 else 0 end ' + quotename(a.name + '_' + b.name)
from cte a
inner join cte b
on b.r > a.r
order by a.r, b.r

set @sql = @sql  + char(10) + 'from ' + quotename(@tableName)
print @sql

NB: That's not to say you should run it as dynamic SQL; rather you can use this to generate your code (unless you need to support the scenario where the number or name of columns may vary at runtime, in which case you'd obviously want the dynamic option).

Upvotes: 1

How do I hash a column of a table in SQL Server?

Answers (2)

Related Questions