How to efficiently seperate parts of an address with varying format?

Question

I have a data set of addresses as strings and I want to seperate them into their parts. What I used so far is the split() method and then some logic to handle the sigle components. This works for very simple examples but explodes in effort when I want to handle other cases. E.g. when there are spaces missing between state and zip-code.

I have also thought about seperating with comma as delimiter but that obviously does not work when there are no commas present.

"1015 Jefferson St, Santa Clara, CA 95050, USA"
"1015 Jefferson St, Santa Clara, CA 95050"
"1015 Jefferson St Santa Clara CA 95050"
"Santa Clara, CA95050"

Is there an efficient way to solve the task of parsing these addresses? The above examples show pretty much all different cases. Also, I would be fine to not seperate street and city for now and all addresses are in the US, so the USA bit can be ignored.

brawden · Accepted Answer

I think, what you are looking for is regular expressions. This is a powerfull tool to match patterns in strings. It is avaliable in many programming languages.

The following code should work for your purpouse. To test and modify regular expressions, this site offers a great test bed.

import re

source_string = "1015 Jefferson St, Santa Clara, CA 95050, USA"

result = re.search(r"(.*?),?\s?([A-Z]{2})\s?([0-9]{5})", source_string)

street_city = result.group(1)
state = result.group(2)
zip_code = result.group(3)

Result:

street_city = 1015 Jefferson St, Santa Clara
state = CA
zip_code = 95050

Explaination:

[A-Z]{2} matches exactly two uper case letters (state).
[0-9]{5} matches exactly five numbers (zip-code).
Between those, there may or may not be a space (\s?) but nothing else.
Before the state, there may or may not be a comma (,?) and a space (\s?) but we don't want them to be part of the street and city result string.
Everything before this is to be taken as street and city string. But we dont want this to contain the trailing space and comma, so we tell it to match 'lazy' by using the ? after the *: .*?
By grouping using the normal brakets, we get groups which we later can indey to get only parts of the total matched string.

How to efficiently seperate parts of an address with varying format?

Answers (1)

Related Questions