Ricalsin
Ricalsin

Reputation: 940

Vim: Parsing address fields from all around the globe

Intro

This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.

Worldwide addresses:

American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:

Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip

where zip can be either a series of just numbers (US) or numbers and letters (Canada).

Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:

Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip

Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:

Foreign Name of Building, A street name, A City, ,zip, Country

The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.

Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country

Parsing Strategy:

A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:

Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters

  1. Last section has a digit (therefore a US or Canadian address)
  2. There a total of 6 sections, so that's one more than the original 5
  3. Knowing that sections 5-2 are zip, state, town, address...
  4. 6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:

Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters

Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.

Under this approach, foreign address get parsed like:

Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country

  1. Last section has no digit, so it must be a Country
  2. That means, moving right to left, the second section is the zip
  3. So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
  4. 7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
  5. We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.

Coding

In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?

Example Addresses

Here are some practice address (US and Foreign) if you are so inclined to help:

City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984

MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502

SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444

123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344

Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6

Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore

Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia

Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa

Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa

Upvotes: 3

Views: 873

Answers (2)

ib.
ib.

Reputation: 28944

The following code is a draft-quality Vim script (hopefully) implementing the address parsing routine described in the question.

function! ParseAddress(line)
    let r = split(a:line, ',\s*', 1)
    let hadcountry = r[-1] !~ '\d'
    let a = {}
    let a.country = hadcountry ? r[-1] : ''
    let r = r[:-1-hadcountry]
    let a.zip = r[-1]
    let a.state = r[-2]
    let a.city = r[-3]
    let a.header = r[0]
    let nleft = len(r) - 4
    if hadcountry
        let a.address1 = r[-4]
        let a.address2 = join(r[1:nleft-1], ', ')
    else
        let a.address1 = r[1]
        let a.address2 = join(r[2:nleft], ', ')
    endif
    return a
endfunction

function! FormatAddress(a)
    let t = map([
    \   ['Header', 'header'],
    \   ['Address 1', 'address1'],
    \   ['Address 2', 'address2'],
    \   ['Town', 'city'],
    \   ['State/Province', 'state'],
    \   ['Country', 'country'],
    \   ['Zip', 'zip']],
    \   'has_key(a:a, v:val[1]) && !empty(a:a[v:val[1]])' .
    \       '? v:val[0] . ": " . a:a[v:val[1]] : ""')
    return join(filter(t, '!empty(v:val)'), '; ')
endfunction

The command below can be used to test the above parsing routines.

:g/\w/call setline(line('.'), FormatAddress(ParseAddress(getline('.'))))

(One can provide a range to the :global command to run it through fewer number of test address lines.)

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 753675

Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.

There are probably other related questions: see the tag street-address for some of them.

Upvotes: 1

Related Questions