Reputation: 33
I receive raw data files from external sources and need to provide analysis on them. I load the files into a table & set the fields as varchars, then run a complex SQL script that does some automated analysis. One issue I've been trying to resolve is: How to tell if a column of data is duplicated with 1 or more other columns in that same table?
My goal is to have, for every column, a hash, checksum, or something similar that looks at a column's values in every row in the order they come in. I have dynamic SQL that loops through every field (different tables will have a variable number of columns) based on the fields listed in INFORMATION_SCHEMA.COLUMNS, so no concerns on how to accomplish that part.
I've been researching this all day but can't seem to find any sensible way to hash every row of a field. Google & StackOverflow searches return how to do various things to rows of data, but I couldn't find much on how to do the same thing vertically on a field.
So, I considered 2 possibilities & hit 2 roadblocks:
The second looked promising but doesn't work as I had hoped...
declare @t1 table
(col_1 varchar(5)
, col_2 varchar(5)
, col_3 varchar(5));
insert into @t1
values ('ABC', 'ABC', 'ABC')
, ('ABC', 'ABC', 'BCD')
, ('BCD', 'BCD', NULL)
, (NULL, NULL, 'ABC');
select * from @t1;
select cs_1 = CHECKSUM(col_1)
, cs_2 = CHECKSUM(col_2)
, cs_3 = CHECKSUM(col_3)
from @t1;
select csa_1 = CHECKSUM_AGG(CHECKSUM([col_1]))
, csa_2 = CHECKSUM_AGG(CHECKSUM([col_2]))
, csa_3 = CHECKSUM_AGG(CHECKSUM([col_3]))
from @t1;
In the last result set, all 3 columns bring back the same value: 2147449198.
Desired results: My goal is to have some code where csa_1 and csa_2 bring back the same value, while csa_3 brings back a different value, indicating that it's its own unique set.
Upvotes: 3
Views: 14409
Reputation: 6783
NEW SOLUTION
EDIT: Based on some new information, namely that there may be more than 200 columns, my suggestion is to compute hashes for each column, but perform it in the ETL tool.
Essentially, feed your data buffer through a transformation that computes a cryptographic hash of the previously-computed hash concatenated with the current column value. When you reach the end of the stream, you will have serially-generated hash values for each column, that are a proxy for the content and order of each set.
Then, you can compare each to all of the others almost instantly, as opposed to running 20,000 table scans.
OLD SOLUTION
Try this. Basically, you'll need a query like this to analyze each column against the others. There is not really a feasible hash-based solution. Just compare each set by its insertion order (some sort of row sequence number). Either generate this number during ingestion, or project it during retrieval, if you have a computationally-feasible means of doing so.
NOTE: I took liberties with the NULL here, comparing it as an empty string.
declare @t1 table
(
rownum int identity(1,1)
, col_1 varchar(5)
, col_2 varchar(5)
, col_3 varchar(5));
insert into @t1
values ('ABC', 'ABC', 'ABC')
, ('ABC', 'ABC', 'BCD')
, ('BCD', 'BCD', NULL)
, (NULL, NULL, 'ABC');
with col_1_sets as
(
select
t1.rownum as col_1_rownum
, CASE WHEN t2.rownum IS NULL THEN 1 ELSE 0 END AS col_2_miss
, CASE WHEN t3.rownum IS NULL THEN 1 ELSE 0 END AS col_3_miss
from
@t1 as t1
left join @t1 as t2 on
t1.rownum = t2.rownum
AND isnull(t1.col_1, '') = isnull(t2.col_2, '')
left join @t1 as t3 on
t1.rownum = t3.rownum
AND isnull(t1.col_1, '') = isnull(t2.col_3, '')
),
col_1_misses as
(
select
SUM(col_2_miss) as col_2_misses
, SUM(col_3_miss) as col_3_misses
from
col_1_sets
)
select
'col_1' as column_name
, CASE WHEN col_2_misses = 0 THEN 1 ELSE 0 END AS is_col_2_match
, CASE WHEN col_3_misses = 0 THEN 1 ELSE 0 END AS is_col_3_match
from
col_1_misses
Results:
+-------------+----------------+----------------+
| column_name | is_col_2_match | is_col_3_match |
+-------------+----------------+----------------+
| col_1 | 1 | 0 |
+-------------+----------------+----------------+
Upvotes: 1
Reputation: 24470
You could compare every column combo in this way, rather than using hashes:
select case when count(case when column1 = column2 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn2
, case when count(case when column1 = column3 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn3
, case when count(case when column1 = column4 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn4
, case when count(case when column1 = column5 then 1 else null end) = count(1) then 1 else 0 end Column1EqualsColumn5
, case when count(case when column2 = column3 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn3
, case when count(case when column2 = column4 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn4
, case when count(case when column2 = column5 then 1 else null end) = count(1) then 1 else 0 end Column2EqualsColumn5
, case when count(case when column3 = column4 then 1 else null end) = count(1) then 1 else 0 end Column3EqualsColumn4
, case when count(case when column3 = column5 then 1 else null end) = count(1) then 1 else 0 end Column3EqualsColumn5
, case when count(case when column4 = column5 then 1 else null end) = count(1) then 1 else 0 end Column4EqualsColumn5
from myData a
Here's the setup code:
create table myData
(
id integer not null identity(1,1)
, column1 nvarchar (32)
, column2 nvarchar (32)
, column3 nvarchar (32)
, column4 nvarchar (32)
, column5 nvarchar (32)
)
insert myData (column1, column2, column3, column4, column5)
values ('hello', 'hello', 'no', 'match', 'match')
,('world', 'world', 'world', 'world', 'world')
,('repeat', 'repeat', 'repeat', 'repeat', 'repeat')
,('me', 'me', 'me', 'me', 'me')
And here's the obligatory SQL Fiddle.
Also, to save you having to write this here's some code to generate the above. This version will also include logic to handle scenarios where both columns' values are null:
declare @tableName sysname = 'myData'
, @sql nvarchar(max)
;with cte as (
select name, row_number() over (order by column_id) r
from sys.columns
where object_id = object_id(@tableName, 'U') --filter on our table
and name not in ('id') --only process for the columns we're interested in
)
select @sql = coalesce(@sql + char(10) + ', ', 'select') + ' case when count(case when ' + quotename(a.name) + ' = ' + quotename(b.name) + ' or (' + quotename(a.name) + ' is null and ' + quotename(b.name) + ' is null) then 1 else null end) = count(1) then 1 else 0 end ' + quotename(a.name + '_' + b.name)
from cte a
inner join cte b
on b.r > a.r
order by a.r, b.r
set @sql = @sql + char(10) + 'from ' + quotename(@tableName)
print @sql
NB: That's not to say you should run it as dynamic SQL; rather you can use this to generate your code (unless you need to support the scenario where the number or name of columns may vary at runtime, in which case you'd obviously want the dynamic option).
Upvotes: 1