Reputation: 1189
I have a data set of addresses as strings and I want to seperate them into their parts. What I used so far is the split()
method and then some logic to handle the sigle components. This works for very simple examples but explodes in effort when I want to handle other cases. E.g. when there are spaces missing between state and zip-code.
I have also thought about seperating with comma as delimiter but that obviously does not work when there are no commas present.
"1015 Jefferson St, Santa Clara, CA 95050, USA"
"1015 Jefferson St, Santa Clara, CA 95050"
"1015 Jefferson St Santa Clara CA 95050"
"Santa Clara, CA95050"
Is there an efficient way to solve the task of parsing these addresses? The above examples show pretty much all different cases. Also, I would be fine to not seperate street and city for now and all addresses are in the US, so the USA
bit can be ignored.
Upvotes: 1
Views: 372
Reputation: 36
I think, what you are looking for is regular expressions. This is a powerfull tool to match patterns in strings. It is avaliable in many programming languages.
The following code should work for your purpouse. To test and modify regular expressions, this site offers a great test bed.
import re
source_string = "1015 Jefferson St, Santa Clara, CA 95050, USA"
result = re.search(r"(.*?),?\s?([A-Z]{2})\s?([0-9]{5})", source_string)
street_city = result.group(1)
state = result.group(2)
zip_code = result.group(3)
Result:
street_city = 1015 Jefferson St, Santa Clara
state = CA
zip_code = 95050
Explaination:
[A-Z]{2}
matches exactly two uper case letters (state).[0-9]{5}
matches exactly five numbers (zip-code).\s?
) but nothing else.,?
) and a space (\s?
) but we don't want them to be part of the street and city result string.?
after the *
: .*?
Upvotes: 2