Reputation: 31

Regex to extract inconsistent postal codes from string

Using the solution posted here, I'm looking to extract postal codes from a list of irregular data in Excel.

Below is a sample of what my data looks like:

Brampton L6P 2G9 ON Canada

M5B2R3 Toronto ON

Toronto M5J 0A6 ON Canada

M1H1T7 Canada

Toronto M4P1T8 ON Canada

MISSISUAGABRAMPTON L5M6S6 ON Canada

333 Sea Ray Inisfil l4e2y6 ON Canada

To call the function, I'm using the following formula

=RegexExtract(A1,"^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$")

However the function is not working for me. I think I need to tweak my regex expression in some way but I don't know what I'm missing.

Upvotes: 3

Answers (2)

user4039065

Reputation:

google-spreadsheet

Try,

=REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d")
'alternate
=left(REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d"), 3)&" "&right(REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d"), 3)

Upvotes: 4

Comintern

Reputation: 22195

You have 2 issues.

First, the expression - if you need to extract the postal code, you can't anchor your regex with ^ and $. The first means "match must occur at the start of the string" and the second means "match must end at the end of the string". This is only useful if you are validating a postal code, but it obviously can't be used to extract one from your examples because they all contain things other than the postal code. The other problem with the regex is the negative look-ahead assertion (?!.*[DFIOQU]), which means "no match can contain the letters D, F, I, O, Q, or U". To the best of my recollection, this isn't supported in VBScript regex. If I'm mistaken, please correct me in the comments.

That gives you the slightly more pedantic expression:

[ABCEGHJKLMNPRSTVX]\d[ABCEGHJKLMNPRSTVWXYZ][ -]?\d[ABCEGHJKLMNPRSTVWXYZ]\d

I took the liberty of optionally allowing a - between the FSA and LDU because I see that a lot, particularly from non-Canadians.

Second, the function that you're calling (copied below from the linked answer):

Function RegexExtract(ByVal text As String, _
                      ByVal extract_what As String, _
                      Optional separator As String = ", ") As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)

For i = 0 To allMatches.count - 1
    For j = 0 To allMatches.Item(i).submatches.count - 1
        result = result & (separator & allMatches.Item(i).submatches.Item(j))
    Next
Next

If Len(result) <> 0 Then
    result = Right$(result, Len(result) - Len(separator))
End If

RegexExtract = result

End Function

The first problem is that it is case sensitive. It is also tailored to extracting submatches, which you don't care about - your examples are looking for a single match.

I'd go with this much simpler option that also correctly formats the output:

Public Function ExtractCanadianPostalCode(inputText As String) As String
    With CreateObject("vbscript.regexp")
        .Pattern = "[ABCEGHJKLMNPRSTVX]\d[ABCEGHJKLMNPRSTVWXYZ][ -]?\d[ABCEGHJKLMNPRSTVWXYZ]\d"
        .IgnoreCase = True
        If .Test(inputText) Then
            Dim matches As Object
            Set matches = .Execute(inputText)
            ExtractCanadianPostalCode = UCase$(Left$(matches(0), 3) & " " & Right$(matches(0), 3))
        End If
    End With
End Function

Upvotes: 3

Regex to extract inconsistent postal codes from string

Answers (2)

Related Questions