Matt Ruwe
Matt Ruwe

Reputation: 3406

Regular expression for parsing mailing addresses

I have an address class that uses a regular expression to parse the house number, street name, and street type from the first line of an address. This code is generally working well, but I'm posting here to share with the community and to see if anyone has suggestions for improvement.

Note: The STREETTYPES and QUADRANT constants contain all of the relevant street types and quadrants respectively.

I've included a subset here:

private const string STREETTYPES = @"ALLEY|ALY|ANNEX|AX|ARCADE|ARC|AVENUE|AV|AVE|BAYOU|BYU|BEACH|...";

private const string QUADRANTS = "N|NORTH|S|SOUTH|E|EAST|W|WEST|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SW|SOUTHWEST";

HouseNumber, Quadrant, StreetName, and StreetType are all properties on the class.

    private void Parse(string line1)
    {
        HouseNumber = string.Empty;
        Quadrant = string.Empty;
        StreetName = string.Empty;
        StreetType = string.Empty;

        if (!String.IsNullOrEmpty(line1))
        {
            string noPeriodsLine1 = String.Copy(line1);
            noPeriodsLine1 = noPeriodsLine1.Replace(".", "");

            string addressParseRegEx =
                @"(?ix)
            ^
            \s*
            (?:
               (?<housenumber>\d+)
               (?:(?:\s+|-)(?<quadrant>" +
                QUADRANTS +
                @"))?
               (?:(?:\s+|-)(?<streetname>\S+(?:\s+\S+)*?))??
               (?:(?:\s+|-)(?<quadrant>" +
                QUADRANTS + @"))?
               (?:(?:\s+|-)(?<streettype>" + STREETTYPES +
                @"))?
               (?:(?:\s+|-)(?<streettypequalifier>(?!(?:" +
                QUADRANTS +
                @"))(?:\d+|\S+)))?
               (?:(?:\s+|-)(?<streettypequadrant>(" +
                QUADRANTS + @")))??
               (?:(?:\s+|-)(?<suffix>(?:ste|suite|po\sbox|apt)\s*\S*))?
            |
               (?:(?:po|postoffice|post\s+office)\s+box\s+(?<postofficebox>\S+))
            )
            \s*
            $
            ";
            Match match = Regex.Match(noPeriodsLine1, addressParseRegEx);
            if (match.Success)
            {
                HouseNumber = match.Groups["housenumber"].Value;
                Quadrant = (string.IsNullOrEmpty(match.Groups["quadrant"].Value)) ? match.Groups["streettypequadrant"].Value : match.Groups["quadrant"].Value;
                if (match.Groups["streetname"].Captures.Count > 1)
                {
                    foreach (Capture capture in match.Groups["streetname"].Captures)
                    {
                        StreetName += capture.Value + " ";
                    }
                    StreetName = StreetName.Trim();
                }
                else
                {
                    StreetName = (string.IsNullOrEmpty(match.Groups["streetname"].Value)) ? match.Groups["streettypequalifier"].Value : match.Groups["streetname"].Value;
                }
                StreetType = match.Groups["streettype"].Value;

                //if the matched street type is found
                //use the abbreviated version...especially for credit bureau calls
                string streetTypeAbbreviation;
                if (StreetTypes.TryGetValue(StreetType.ToUpper(), out streetTypeAbbreviation))
                {
                    StreetType = streetTypeAbbreviation;
                }
            }
        }

    }

Upvotes: 10

Views: 10733

Answers (7)

MoXplod
MoXplod

Reputation: 3852

If someone runs into this problem in 2013/2014 :) You can use google geocode API. it provides more functionality than just regex - you can even get lat/long for address. And its free

For an address example-

http://maps.googleapis.com/maps/api/geocode/xml?address=2520%20Cohasset%20Rd%20-%20Chico%2C%20CA%2095973-1307%20530-893-1300%20%20&sensor=false

enter image description here

Upvotes: 1

Curtis Maurand
Curtis Maurand

Reputation: 21

This actually works pretty well except that it doesn't pull apartment numbers. We're working on that. It also coughed a little when we had an address of 769 Branch Ave. Of course "branch" is one of the street types that its looking for. It all goes back that making order out of chaos thing. We know that its going to break here and there.

Upvotes: 2

Adrian Archer
Adrian Archer

Reputation: 2323

I don't know what country you're in, but if you're in the USA and want to spend some money on address validation, you can buy related USPS products here. And here is a good place to find free word lists from the USPS for expected words and abbreviations. I'm sure similar pages are available for other countries.

Upvotes: 7

VictorB
VictorB

Reputation: 578

I'll agree that your strictness is going to be a problem. I'm writing an address parser designed to strip addresses from classified ads where the format could be just about anything. For instance, for your quadrant matches, you're ignoring punctuation altogether. I have to search data that could represent NE in all these different ways:

"NE", "N.E", "N E", "N.E.", "N. E", "North East", "Northeast"

so I am using the following pattern match which should catch all direction qualifiers no matter how they are expressed:

\b(?:(?:[nesw]\.? ?){0,2}|(?:north|no\.|east|south|so\.|west){0,2})\b

Of course, context is also important since "no" is going to be matched by this. But "NE" for Nebraska would be matched by either, so you really have to be careful about what's to the left and right in your larger expression. I'm having to compile lists of words that commonly appear interspersed in address texts which are not address components, such as "near, x-street, in, across", etc.

It is a very tough problem, and I agree Salt Lake City is a bitch. In addition to having the double direction/coordinate format, they also compound it by referring to stuff like "3700 North 5300 East Arborville Way" where the streets can be referenced by name, number, or both.

Upvotes: 0

Will Hartung
Will Hartung

Reputation: 118593

Have fun with addresses and regexs, you're in for a long, horrible ride.

You're trying to lay order upon chaos.

For every "123 Simple Way", there's a "14 1/2 South".

Then, for extra laughs, there's Salt Lake City: "855 South 1300 East".

Have fun with that.

There are more exceptions than rules when it comes to street adresses.

Upvotes: 9

Brian Parker
Brian Parker

Reputation:

I tried to get this to work, but it seems as though you have a static member of a StreetTypes class that is not included. It seems to work except for that, but I can not do much testing without it.

Upvotes: 0

Renaud Bompuis
Renaud Bompuis

Reputation: 16776

I think you should clarify your usage scenario.

Unless you're in a very, very limited scenario where you know that the addresses were entered following a strict schema, parsing addresses for content is an extremely hard problem to solve and, usually, quite futile (unless it's the raison d'être of your application).

If you're limited to a particular country that has very specific conventions for writing addresses, then using these regex might get you 90% of the way.
However, as soon as you have to start accepting foreign addresses, you're screwed.
Even if you're a US-centric site, there is a good chance that you may have to be able to accept addresses from US citizen living abroad for instance.

Again, it may be OK in a very narrow field, but it's almost always a bad idea to validate or split addresses that were not strictly validated and constrained at the time the user entered them.
When you do enforce some strict rules for users to enter their addresses, these end-up being inadequate in a small portion of cases, even in the best address validation components out there.

Just a few things that mess up address parsing:

  • postal codes (Zip codes) are sometimes placed before, after, or may even not exist at all.
  • postal codes follow strict rules: a 10-digit Zip code is probably easy to spot as invalid, but what about a non-existent one? What about more codes such as those used in the UK for instance?
  • What about a place like Hong Kong where you could write the address in either English, Traditional Chinese or Mandarin?
  • What if it's perfectly fine to split your address and write it out of sequence?
  • even if you're just parsing US addresses, there are at least a handfull of ways to describe a PO box: you can also use poste restante, general delivery and then need to add a 4-digit code to the Zip code, which would normally probably not be present at all...

Bottom line is

If getting addresses in a parseable format is really important, be 100% sure that you can get all possible combinations right or you're going to have a percentage of failures that will mean frustrated users and loss sales.
If you don't have 100% case coverage then don't enforce strict rules on the user.
I can't count the number of websites I gave up purchasing from because they would require a Zip/Postal Code when the place I live in has none.

Sorry for the rant, but I think it's important that people wanting to do address validation and parsing think hard about what they're getting themselves in.

Upvotes: 6

Related Questions