Fredou
Fredou

Reputation: 20100

find if two dataset are identical, one in .NET and one in SQL Server

I have to find if two dataset are identical without using primary key, only data.

If it exist, I have to return the primary key otherwise I have to insert a new row and return the new primary key

Instead of comparing each field individually, since I think this could be slow????, I came up with this solution

(I'm going to take an address as an example)

On the SQL Server side, a persisted computed column with an index on it;

-- nice new function in 2012; concat
hashbytes('SHA2_256',
    concat(
        nchar(0x2460),[Address_Type_Id],
        nchar(0x2461),[Street_Name],
        nchar(0x2462),[Municipality],
        nchar(0x2463),[Postal_Zip_Code],
        nchar(0x2464),[Unit],
        nchar(0x2465),[Street_Number],
        nchar(0x2466),[PO_Box],
        nchar(0x2467),[Rural_Route_Number],
        nchar(0x2468),[Street_Type_Id],
        nchar(0x2469),[Country_Type_Id],
        nchar(0x246A),[Country_Division_Type_Id],
        nchar(0x246B),[Country_Division_Other],
        nchar(0x246C),[Direction_Type_Id]
    )
)

on the .NET side;

using(var mySHA256 = System.Security.Cryptography.SHA256Managed.Create())
{
    var Value = string.Concat(
            (char)0x2460, PartyAddress.AddressTypeId,
            (char)0x2461, PartyAddress.StreetName,
            (char)0x2462, PartyAddress.Municipality,
            (char)0x2463, PartyAddress.PostalZip,
            (char)0x2464, PartyAddress.Unit,
            (char)0x2465, PartyAddress.StreetNumber,
            (char)0x2466, PartyAddress.POBOX,
            (char)0x2467, PartyAddress.RuralRouteNumber,
            (char)0x2468, PartyAddress.StreetTypeId,
            (char)0x2469, PartyAddress.CountryTypeId,
            (char)0x246A, PartyAddress.CountryDivisionTypeId,
            (char)0x246B, PartyAddress.CountryDivisionOther,
            (char)0x246C, PartyAddress.DirectionTypeId);

    var hashValue = mySHA256.ComputeHash(Encoding.Unicode.GetBytes(Value));
}

After that I can now compare hashValue and the computed column and it does work.

My question is, before I start implementing this on MANY other table, should I look at another, better, solution?

EDIT **************

to add another solution, field by field comparison, since I'm using entity framework would be like this

var addressExist = Tombstone.Addresses
                            .FirstOrDefault(x => 
        x.Address_Type_Id == PartyAddress.AddressTypeId &&
        x.Street_Name == PartyAddress.StreetName &&
        x.Municipality == PartyAddress.Municipality &&
        x.Postal_Zip_Code == PartyAddress.PostalZip &&
        x.Unit == PartyAddress.Unit &&
        x.Street_Number == PartyAddress.StreetNumber &&
        x.PO_Box == PartyAddress.POBOX &&
        x.Rural_Route_Number == PartyAddress.RuralRouteNumber &&
        x.Street_Type_Id == PartyAddress.StreetTypeId &&
        x.Country_Type_Id == PartyAddress.CountryTypeId &&
        x.Country_Division_Type_Id == PartyAddress.CountryDivisionTypeId &&
        x.Country_Division_Other == PartyAddress.CountryDivisionOther &&
        x.Direction_Type_Id == PartyAddress.DirectionTypeId);

which generate this query

SELECT TOP (1) [Extent1].[Address_Id] AS [Address_Id]
    FROM [dbo].[Address] AS [Extent1]
    WHERE ([Extent1].[Address_Type_Id] = @p__linq__0) AND 
          (([Extent1].[Street_Name] = @p__linq__1) OR (([Extent1].[Street_Name] IS NULL) AND (@p__linq__1 IS NULL))) AND 
          (([Extent1].[Municipality] = @p__linq__2) OR (([Extent1].[Municipality] IS NULL) AND (@p__linq__2 IS NULL))) AND 
          (([Extent1].[Postal_Zip_Code] = @p__linq__3) OR (([Extent1].[Postal_Zip_Code] IS NULL) AND (@p__linq__3 IS NULL))) AND 
          (([Extent1].[Unit] = @p__linq__4) OR (([Extent1].[Unit] IS NULL) AND (@p__linq__4 IS NULL))) AND 
          (([Extent1].[Street_Number] = @p__linq__5) OR (([Extent1].[Street_Number] IS NULL) AND (@p__linq__5 IS NULL))) AND 
          (([Extent1].[PO_Box] = @p__linq__6) OR (([Extent1].[PO_Box] IS NULL) AND (@p__linq__6 IS NULL))) AND 
          (([Extent1].[Rural_Route_Number] = @p__linq__7) OR (([Extent1].[Rural_Route_Number] IS NULL) AND (@p__linq__7 IS NULL))) AND
          (([Extent1].[Street_Type_Id] = @p__linq__8) OR (([Extent1].[Street_Type_Id] IS NULL) AND (@p__linq__8 IS NULL))) AND 
          (([Extent1].[Country_Type_Id] = @p__linq__9) OR (([Extent1].[Country_Type_Id] IS NULL) AND (@p__linq__9 IS NULL))) AND 
          (([Extent1].[Country_Division_Type_Id] = @p__linq__10) OR (([Extent1].[Country_Division_Type_Id] IS NULL) AND (@p__linq__10 IS NULL))) AND 
          (([Extent1].[Country_Division_Other] = @p__linq__11) OR (([Extent1].[Country_Division_Other] IS NULL) AND (@p__linq__11 IS NULL))) AND 
          (([Extent1].[Direction_Type_Id] = @p__linq__12) OR (([Extent1].[Direction_Type_Id] IS NULL) AND (@p__linq__12 IS NULL)))

Upvotes: 3

Views: 102

Answers (1)

Mrchief
Mrchief

Reputation: 76208

If you think comparing each field is going to be slow, consider your alternative solution which comprises of:

  • take each field and append a char
  • take each filed and concatenate with another one, repeating this process n times (where is no. of fields)
  • computing SHA256 hash

So you just made 2n operation to 2 x (2n + 2n + hash(n)). Plus you're creating lots of string each of which is adding to CPU cycles and memory (strings are immutable in .Net).

You can clearly see which one is going to be slower.

Update:

  • Since the hash is calculated and persisted, then you're not incurring 2n overhead, just the processing on .Net side.
  • A StringBuilder doesn't magically cause you any savings, especially for such low number of operations.
  • Having said that, hashing is probably quicker if you index the hash value on SQL Server. Since you want a generic solution, the performance difference of hashing vs. accessing all those columns can vary. I would advise to run some benchmarks against worst case scenario to see which one performs better.
  • One word of caution about appending the char - you're now modifying data. If you can guarantee that the char you're appending will never occur in the actual data, you're good. Otherwise, you yourself are introducing a chance of collision.
  • Another idea is that you can exclude comparing NULL values, since you already know they are identical.

Upvotes: 2

Related Questions