Reputation: 32343
I have a structure which describes the address, it looks like:
class Address
{
public string AddressLine1 { get; set; }
public string AddressLine2 { get; set; }
public string City { get; set; }
public string Zip { get; set; }
public string Country { get; set; }
}
I'm looking for a way to create an unique identifier for this structure (I assume it should be also of a type of string
) which is depend on all the structure properties (e.g. change of AddressLine1
will also cause a change of the structure identifier).
I know, I could just concatenate all the properties together, but this gives too long identifier. I'm looking for something significantly shorter than this.
I also assume that the number of different addresses should not be more than 100M.
Any ideas on how this identifier can be generated?
Thanks in advance.
A prehistory of this:
There are several different tables in the database which hold some information + address data. The data is stored in the format similar to the one described above.
Unfortunately, moving the address data into a separate table is very costly right now, but I hope it will be done in the future.
I need to associate some additional properties with the address data, and going to create a separate table for this. That's why I need to unique identify the address data.
Upvotes: 1
Views: 988
Reputation: 21349
Here is a possible way most people do think about:
But the real problem comes when you need to normalize the address. For instance, those streets are the same:
You could try to normalize the address lower casing the text, removing accents/cedillas/dashes and with the closest ASCII char and parsing the number to keep it aside, but there will still unforeseen exceptions. And, a single different char will produce a completely different hash.
Unless all your addresses are perfectly normalized, I would suggest relying on an external service like here.com
There are 3 ways of using the service
My favorite goes to 1. as we can still find the address back from the coordinates (while this is impossible with hash) moreover, an address might change (new street name for instance) while coordinates should not. Last but not least, you might have 2 completely different addresses for the same location, this is easier to reconcile them using coordinates.
Upvotes: 1
Reputation: 10427
Here is a complete example using serialization, sha256 hashing and base64 encoding (based on CodesInChaos answer):
using System;
using System.IO;
using System.Security.Cryptography;
using System.Runtime.Serialization.Formatters.Binary;
namespace Uniq
{
[Serializable]
class Address
{
public string AddressLine1 { get; set; }
public string AddressLine2 { get; set; }
public string City { get; set; }
public string Zip { get; set; }
public string Country { get; set; }
}
class MainClass
{
public static void Main (string[] args)
{
Address address1 = new Address(){AddressLine1 = "a1"};
Address address2 = new Address(){AddressLine1 = "a1"};
Address address3 = new Address(){AddressLine1 = "a2"};
string unique1 = GetUniqueIdentifier(address1);
string unique2 = GetUniqueIdentifier(address2);
string unique3 = GetUniqueIdentifier(address3);
Console.WriteLine(unique1);
Console.WriteLine(unique2);
Console.WriteLine(unique3);
}
public static string GetUniqueIdentifier(object obj){
if (obj == null) return "0";
SHA256 mySHA256 = SHA256Managed.Create ();
BinaryFormatter formatter = new BinaryFormatter ();
MemoryStream stream = new MemoryStream();
formatter.Serialize(stream, obj);
byte[] hash = mySHA256.ComputeHash(stream.GetArray());
string uniqId = Convert.ToBase64String(hash);
return uniqId;
}
}
}
Edit: this is a version without using BinaryFormatter
. You may replace the null representation and the field separator to anything that suits your needs.
public static string GetUniqueIdentifier(object obj){
if (obj == null) return "0";
SHA256 mySHA256 = SHA256Managed.Create ();
StringBuilder stringRep = new StringBuilder();
obj.GetType().GetProperties()
.ToList().ForEach(p=>stringRep.Append(
p.GetValue(obj, null) ?? '¨'
).Append('^'));
Console.WriteLine(stringRep);
Console.WriteLine(stringRep.Length);
byte[] hash = mySHA256.ComputeHash(Encoding.Unicode.GetBytes(stringRep.ToString()));
string uniqId = Convert.ToBase64String(hash);
return uniqId;
}
Upvotes: 0
Reputation: 108840
Serialize all fields to a large binary value. For example using concatenation with proper domain separation.
Then hash that value with a cryptographic hash of sufficient length. I prefer 256 bits, but 128 are probably fine. Collisions are extremely rare with good hashes, with a 256 bit hash like SHA-256 they're practically impossible.
Upvotes: 3