Oleks
Oleks

Reputation: 32343

How to generate an unique identifier for the address structure?

I have a structure which describes the address, it looks like:

class Address
{
    public string AddressLine1 { get; set; }
    public string AddressLine2 { get; set; }
    public string City { get; set; }
    public string Zip { get; set; }
    public string Country { get; set; }
} 

I'm looking for a way to create an unique identifier for this structure (I assume it should be also of a type of string) which is depend on all the structure properties (e.g. change of AddressLine1 will also cause a change of the structure identifier).

I know, I could just concatenate all the properties together, but this gives too long identifier. I'm looking for something significantly shorter than this.

I also assume that the number of different addresses should not be more than 100M.

Any ideas on how this identifier can be generated?

Thanks in advance.

A prehistory of this:

There are several different tables in the database which hold some information + address data. The data is stored in the format similar to the one described above.

Unfortunately, moving the address data into a separate table is very costly right now, but I hope it will be done in the future.

I need to associate some additional properties with the address data, and going to create a separate table for this. That's why I need to unique identify the address data.

Upvotes: 1

Views: 988

Answers (3)

Flavien Volken
Flavien Volken

Reputation: 21349

Here is a possible way most people do think about:

  1. Normalize the address
  2. Create a hash from the normalized address
  3. Done…

But the real problem comes when you need to normalize the address. For instance, those streets are the same:

  • "place Saint-François 14"
  • "place saint françois 14"
  • "place st. françois 14"
  • "place st. francois 14"
  • "14 Place saint François"

You could try to normalize the address lower casing the text, removing accents/cedillas/dashes and with the closest ASCII char and parsing the number to keep it aside, but there will still unforeseen exceptions. And, a single different char will produce a completely different hash.

Unless all your addresses are perfectly normalized, I would suggest relying on an external service like here.com

There are 3 ways of using the service

  1. Either use the service to find the coordinates of your address (long, lat, altitude) then use those as your id (or 3 ids)
  2. Or you could use the service to find the address and keep their own ID in your DB. The drawback is that they do not guarantee that their ID will not change.
  3. The last is to use their service to find your address into their registry, then use their entry (which will be normalized according to their standard) to create the hash.

My favorite goes to 1. as we can still find the address back from the coordinates (while this is impossible with hash) moreover, an address might change (new street name for instance) while coordinates should not. Last but not least, you might have 2 completely different addresses for the same location, this is easier to reconcile them using coordinates.

Upvotes: 1

Ahmed KRAIEM
Ahmed KRAIEM

Reputation: 10427

Here is a complete example using serialization, sha256 hashing and base64 encoding (based on CodesInChaos answer):

using System;
using System.IO;
using System.Security.Cryptography;
using System.Runtime.Serialization.Formatters.Binary;

namespace Uniq
{
    [Serializable]
    class Address
    {
        public string AddressLine1 { get; set; }
        public string AddressLine2 { get; set; }
        public string City { get; set; }
        public string Zip { get; set; }
        public string Country { get; set; }
    } 
    class MainClass
    {
        public static void Main (string[] args)
        {
            Address address1 = new Address(){AddressLine1 = "a1"};
            Address address2 = new Address(){AddressLine1 = "a1"};
            Address address3 = new Address(){AddressLine1 = "a2"};
            string unique1 = GetUniqueIdentifier(address1);
            string unique2 = GetUniqueIdentifier(address2);
            string unique3 = GetUniqueIdentifier(address3);
            Console.WriteLine(unique1);
            Console.WriteLine(unique2);
            Console.WriteLine(unique3);
        }
        public static string GetUniqueIdentifier(object obj){
            if (obj == null) return "0";
            SHA256 mySHA256 = SHA256Managed.Create ();
            BinaryFormatter formatter = new BinaryFormatter ();
            MemoryStream stream = new MemoryStream();
            formatter.Serialize(stream, obj);
            byte[] hash = mySHA256.ComputeHash(stream.GetArray());
            string uniqId = Convert.ToBase64String(hash);
            return uniqId;
        }
    }
}

Edit: this is a version without using BinaryFormatter. You may replace the null representation and the field separator to anything that suits your needs.

public static string GetUniqueIdentifier(object obj){
    if (obj == null) return "0";
    SHA256 mySHA256 = SHA256Managed.Create ();
    StringBuilder stringRep = new StringBuilder();
    obj.GetType().GetProperties()
                .ToList().ForEach(p=>stringRep.Append(
            p.GetValue(obj, null) ?? '¨'
            ).Append('^'));
    Console.WriteLine(stringRep);
    Console.WriteLine(stringRep.Length);
    byte[] hash = mySHA256.ComputeHash(Encoding.Unicode.GetBytes(stringRep.ToString()));
    string uniqId = Convert.ToBase64String(hash);
    return uniqId;
}

Upvotes: 0

CodesInChaos
CodesInChaos

Reputation: 108840

Serialize all fields to a large binary value. For example using concatenation with proper domain separation.

Then hash that value with a cryptographic hash of sufficient length. I prefer 256 bits, but 128 are probably fine. Collisions are extremely rare with good hashes, with a 256 bit hash like SHA-256 they're practically impossible.

Upvotes: 3

Related Questions