mmix
mmix

Reputation: 6278

Unique, but realistic, object hash code

Ok, I am designing up a piece of software that will keep one system synced with another. The problem is that the originating system is some legacy DB2 nightmare with me only having read-only access and tables having no timestamping capability whatsoever, meaning no way to detect which rows were changed.

My idea is to just load all the rows (in total I will have about 60000 rows, synced every half hour) calculating their hashes, whilst keeping <ID, hash> tuples in my integration database. Then change detection becomes a job of comparing hashes and updating records in destination system where hashes mismatch or tuples missing altogether. Forgot to mention that reading source is cheap, updating destination is expensive, its a web service with a lot of background processing, so I would avoid updating everything every time.

Now, my problem, the c# builtin hashcode claims that its unsuitable for this purpose (equal hash does not imply equal object) and crypto hashes seem like a big overkill with 256+ bit hashes. I don't think more than 64bits is needed, that would give me 1 in 1010 chance of collision given perfectly distributed hash and allow fast hash comparison on x64 arch.

So what should I use to generate unique hashes?

Upvotes: 1

Views: 176

Answers (2)

Steve Cooper
Steve Cooper

Reputation: 21480

Another option; calculate the hash in C# using a function like this;

private readonly System.Security.Cryptography.HashAlgorithm hash = System.Security.Cryptography.SHA1.Create();

public static string CalculateSignature(IEnumerable<object> values)
{
    var sb = new StringBuilder();
    foreach (var value in values)
    {
        string valueToHash = value == null ? ">>null<<" : Convert.ToString(value, CultureInfo.InvariantCulture);
        sb.Append(valueToHash).Append(char.ConvertFromUtf32(0));
    }
    var signature = sb.ToString();
    var bytesToHash = Encoding.UTF8.GetBytes(signature);
    var hashedBytes = hash.ComputeHash(bytesToHash);
    signature = Encoding.UTF8.GetString(hashedBytes);

    return signature;
}

Edit: Hashing profiling test

To show how fast SHA1 hashing is, here's a quick test. On my dev machine, I get 60,000 hashes in 176ms. MD5 takes 161

var hash = System.Security.Cryptography.MD5.Create();

var stringtoHash = "3490518cvm90wg89puse5gu3tgu3v0afgmvkldfjgmvvvvvsh,9semc9petgucm9234ucv0[vhd,flhgvzemgu904vq2m0";

var sw = System.Diagnostics.Stopwatch.StartNew();
for(var i = 0; i < 60000; i++)
{
    var bytesToHash = Encoding.UTF8.GetBytes(stringtoHash);
    var hashedBytes = hash.ComputeHash(bytesToHash);
    var signature = Encoding.UTF8.GetString(hashedBytes);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);

Upvotes: 2

Steve Cooper
Steve Cooper

Reputation: 21480

In your staging SQL tables, add a 'checksum' column, using SQL's checksum function;

Something like this;

update mysourcetable set check = checksum(id, field1, field2, field3, field4 ...)

Clarification

You mentioned having an integration database; my thought was that you would read the data from DB2 into an interim database, like SQL server, where you're already storing ID/hash pairs. If you copied all the data out of DB2, not just the IDs, then you could calculate the checksum in the integration database.

Upvotes: 0

Related Questions