How to quantify differences between data sets in SQL Server?

Question

I'm comparing two datasets in SQL Server (tables of the same schema) using row hashing (for example, using CheckSum() or HashBytes()). At this point, I can tell which records are identical and which are different. Given different records, I am looking for a way to quantify these differences. for example, consider the two simplified tables below: table1: row11: 0, 0, 0 --> hash1 = 0x0000

table2: row21: 0, 0, 1 --> hash2 = 0x0001

table3: row31: 1, 1, 1 --> hash3 = x

The inequality of row11, row21, row31 is apparent in the fact that: hash1 <> hash2 <> hash3.

the question is, how do I associate a magnitude of this difference with the value of the hashes? In other words, how can I tell, just from the hash value, that the pair (row11, row21) is "more similar" than the pair (row11, row31)?

How to quantify differences between data sets in SQL Server?

Answers (0)

Related Questions