aggregate or other special computational functions against MD5 varbinary values

Question

I have a very long natural key (hundreds of characters) in a chemical application. I can use MD5 against the value to create an ersatz natural key. What I'd like to know is whether the following is possible; consider two sets of child rows:

       ParentTable
       parent1
       parent2

       ChildTable
       parent1....md5 value a
       parent1....md5 value b
       parent1....md5 value c

       parent2....md5 value b
       parent2....md5 value c
       parent2....md5 value a

Is it possible to "compute" each three-row set of md5 values and compare the computed values against each other in order to determine whether parent1 and parent2 have the same set of children? The children are not in the same "order" but there is really no "order" in any case -- those three tuples are really just a "bag" of rows.

So the question could be, do parent1 and parent2 have the same things in their respective bags?

And again, to be clear, I am hoping to avoid having to compare one set against the other, row by row. Ideally I'd like to find something analogous to select sum(md5) from ChildTable where parent=1.

EDIT: there would be a unique index on the child table to prevent the same value from occurring more than once for the same parent: create unique index UX_CHILDTABLE on ChildTable(parent, md5);

EDIT2: "sameness" or "equality" when bag is compared to bag would be defined as : they contain the same set of MD5 values, i.e. every item in one bag has its twin in the other bag. All of bag1's entities are found in bag2 and all of bag2's entities are found in bag1.

PrimeWaffle · Accepted Answer

I have a potential solution to your answer, but it will depend on some clarification. Since I'm a new user I don't have the ability to ask in a comment. You state that you have "bags" of rows corresponding to each parent. From my understanding, bags can have duplicate items. If duplicate MD5s exist in your parents, do you want to only consider the distinct ones? Secondly, are you looking for equality only if bag A = bag B or also when bag A is a subset of bag B and vise versa?

I have presented a solution which looks for only distinct MD5s considering an equality relation:

Say we have two parent tables which are aggregated in a child table as your example shows:

--Create tables that hold our data
if object_id('tempdb..#parent1') is not null drop table #parent1
create table #parent1 (parent varchar(20), words varchar(20), chem varchar(100), bHash varchar(34))
insert into #parent1 values ('parent 1', 'this', 'a', upper(sys.fn_sqlvarbasetostr(hashbytes('md5', 'a'))))
insert into #parent1 values ('parent 1', 'does', 'b', upper(sys.fn_sqlvarbasetostr(hashbytes('md5', 'b'))))
insert into #parent1 values ('parent 1', 'not', 'c', upper(sys.fn_sqlvarbasetostr(hashbytes('md5', 'c'))))

if object_id('tempdb..#parent2') is not null drop table #parent2
create table #parent2 (parent varchar(20), words varchar(20), chem varchar(100), bHash varchar(34))
insert into #parent2 values ('parent 2', 'matter', 'b', upper(sys.fn_sqlvarbasetostr(hashbytes('md5', 'b'))))
insert into #parent2 values ('parent 2', 'just', 'c', upper(sys.fn_sqlvarbasetostr(hashbytes('md5', 'c'))))
insert into #parent2 values ('parent 2', 'words', 'a', upper(sys.fn_sqlvarbasetostr(hashbytes('md5', 'a'))))
insert into #parent2 values ('parent 2', 'duplicate', 'a', upper(sys.fn_sqlvarbasetostr(hashbytes('md5', 'a'))))

if object_id('tempdb..#child') is not null drop table #child
select *
into #child
from #parent1
union all
select * from #parent2

We can concatenate the row values of the MD5s into a single ordered list by treating the MD5 values as a string. After concatenation, we can generate a new MD5 hash which applies to the parent's unique collection of values.

if object_id('tempdb..#parentHash') is not null drop table #parentHash
select distinct parent
      --We can create a new hash off an ordered list of distinct values (this is the list)
      ,stuff((select ','+o.bHash
             from (select distinct parent, bHash from #child) o
             where o.parent = t.parent
             order by o.bHash
             for xml path('')),1,1,'') ordered_list
      --Create the hash
      ,upper(sys.fn_sqlvarbasetostr(hashbytes('md5',
           stuff((select ','+o.bHash
                 from (select distinct parent, bHash from #child) o
                 where o.parent = t.parent
                 order by o.bHash
                 for xml path('')),1,1,'')
       ))) parentHash
into #parentHash
from (select distinct parent, bHash from #child) t

You could then query off of these values where parent hashes are equivalent.

select distinct o.parent
      ,t.parent
from #parentHash o
inner join #parentHash t on o.parentHash = t.parentHash and o.parent <> t.parent

Again, this solution assumes that you are only looking for bags which have equivalent unique values. If you are concerned about duplicates or parents which are subsets of one another a different solution would have to be explored.

aggregate or other special computational functions against MD5 varbinary values

Answers (1)

Related Questions