Ryan
Ryan

Reputation: 23

comparing two address columns

I have addresses in U and V. I want to see if they are somewhat similar and if they are say "Update" If not say "Omit".

For example 246 N High street in U and 246 North High St in V would return a value of Update. 246 N High Street in U and 458 Auburn Drive in V would return a value of Omit.

Any ideas?

Upvotes: 1

Views: 4031

Answers (2)

JNevill
JNevill

Reputation: 50218

There are a lot of algorithms for doing fuzzy matching. One of the easier ones to implement in excel is N-Gram.

To perform an n-gram match, we have to break each address up into a list of sets of smaller character lengths. A 2-gram list of your address 246 N High street would look like 24,46,6 , N,N , H,Hi,ig,gh,h , s,st,tr,re,ee,et. We could do the same with a 3-gram: 246,46 ,6 N, N ,N H, Hi,Hig,igh,gh ,h s, st,str,tre,ree,eet

We do this with both addresses, then we can check each item in the first address's list to see if it appears in the second address's list; count the matches and divide that by the number of items in the first list. That will give you a percentage of how close they are.

You could get fancy with cell formulas mid() and countif() to do this with sheet formulas, but I think it's easier to just write it out in VBA and make it a UDF.

Function NGramCompare(string1 As String, string2 As String, intGram As Integer) As Double
    'Take in two strings and the N-gram
    Dim intChar As Integer, intGramMatch As Integer
    Dim ngramList1 As String, ngramList2 As String, nGram As Variant
    Dim nGramArr1 As Variant

    'split the first string into a list of ngrams        
    For intChar = 1 To Len(string1) - (intGram-1)
        If ngramList1 <> "" Then ngramList1 = ngramList1 & ","
        ngramList1 = ngramList1 & Mid(string1, intChar, intGram)
    Next intChar

    'split the secong string into a list of ngrams        
    For intChar = 1 To Len(string2) - (intGram-1)
        If ngramList2 <> "" Then ngramList2 = ngramList2 & ","
        ngramList2 = ngramList2 & Mid(string2, intChar, intGram)
    Next intChar

    'Split the ngramlist1 into an array through which we can iterate
    nGramArr1 = Split(ngramList1, ",")

    'Iterate through array and compare values to ngramlist2
    For Each nGram In nGramArr1
        If InStr(1, ngramList2, nGram) Then
            'we found a match, add to the counter
            intGramMatch = intGramMatch + 1
        End If
    Next nGram

    'output the percentage of grams matching.
    NGramCompare = intGramMatch / (UBound(nGramArr1) + 1)
End Function

If you've never used a UDF:

  1. Go to visual basic editor (VBE) with Alt+F11
  2. In the VBA Project window, find your workbook and right click on the name
  3. Choose: Insert>>Module
  4. Double click the new module in the list to bring up it's code window
  5. Paste this function in and save your workbook

Then, assuming address1 is in A1 and address2 is in B1 you can put, in C1:

=NGramCompare(A1, B1, 2) 

Which, for your first address, will spit out 56%. Which seems like a reasonably good match. If you find you are getting too many positive hits, you can change your 2-gram to be a 3-gram by changing that last parameter.

To take it a step further so it will say "Update" or "Omit" you could do:

=If(NGramCompare(A1, B1, 2)>.30, "Update", "Omit")

I just set that so that it will consider a match anything above 30%, but you can adjust as necessary. No matter where you set it, you will probably end up with a percentage of compares that are false positives or false negatives, but that's the way fuzzy matching goes.

Upvotes: 2

Slai
Slai

Reputation: 22876

Some of the naive approaches can be to compare the first few characters

=LEFT(A1,5)=LEFT(B1,5)

or to replace parts until they match

=(SUBSTITUTE(SUBSTITUTE(LOWER(A2)," street"," ST")," north "," N ")
 =SUBSTITUTE(SUBSTITUTE(LOWER(B2)," street"," ST")," north "," N "))

both will probably turn into a big ugly formula after adjusting for most cases

Upvotes: 0

Related Questions