Regular expression in python to capture multiple forms of badly formatted addresses

Question

I have been tweaking a regular expression over several days to try to capture, with a single definition, several cases of inconsistent format in the address field of a database.

I am new to Python and regular expressions, and have gotten great feedback here is stackoverflow, and with my new knowledge, I built a RegEx that is getting close to the final result, but still can't spot the problem.

import re

r1 = r"([\w\s+]+),?\s*$?([\w\s+\/]+)$?\s*$?([\w\s+\/]+)$?"

match1 = re.match(r1, 'caracas, venezuela')
match2 = re.match(r1, 'caracas (venezuela)')
match3 = re.match(r1, 'caracas, (venezuela) (df)')

group1 = match1.groups()
group2 = match2.groups()
group3 = match3.groups()

print group1
print group2
print group3

This thing should return 'caracas, venezuela' for groups 1 and 2, and 'caracas, venezuela, df' for group 3, instead, it returns:

('caracas', 'venezuel' 'a') 
('caracas ', 'venezuel' 'a')
('caracas', 'venezuela', 'df')

The only perfect match is group 3. The other 2 are isolating the 'a' at the end, and the 2nd one has an extra space at the end of 'caracas '. Thanks in advance for any insight.

Cheers!

machine yearning · Accepted Answer

Regular expressions might be overkill... what exactly is your problem statement? What do you need to capture?

Some things I caught (in order of appearance in your regex; sometimes it helps to read it out, left-to-right, English-style):

([\w\s+]+)

This says, "capture one or more (letter or one or more spaces)"

Do you really want to capture the spaces at the end of the city name? Also, you don't need (indeed, shouldn't have) the 1-or-more symbol + inside your brackets [ ], since your regex will already be matching one or more of them based on the outer +. I'd rewrite this part like this:

([\w\s]*\w)

Which will match eagerly up to the last alphanumeric character ("zero or more (letter or space) followed by a letter"). This does assume you have at least one character, but is better than your assumption that a single space would work as well.

Next you have:

,?\s*$?

which looks okay to me except that it doesn't guarantee that you'll see either a comma or an open paren anymore. What about:

(?:,\s*\(|,\s*|\s*\()

which says, "non-capturingly match either (a comma with maybe some spaces and then an open paren) OR (a comma with maybe some spaces) OR (maybe some spaces and then an open paren)". This enforces that you must have either a comma or a paren or both.

Next you have the capturing expression, very similar to the first:

([\w\s+\/]+)

Again, you don't want the spaces (or slashes in this case) at the end of the city name, and you don't want the + inside the [ ]:

([\w\s\/]*\w)

The next expression is probably where you're getting your venezuel a problem; let's take a look:

$?\s*$?([\w\s+\/]+)$?

This is a rather long one, so let's break it down:

\)?\s*$?

says to "maybe match a close paren, and then maybe some spaces, and then maybe an open paren". This is okay I guess, let's move on to the real problem:

([\w\s+\/]+)

This capturing group MUST match at least one character. If the matcher sees "venezuela" at the end of your address, it will eagerly match the characters venezuel and then need to satisfy this final expression with what it has left, a. Try instead:

$?\s*

Followed by making your entire final expression optional, and the outer expression non-capturing:

(?:$?([\w\s+\/]+)$?)?

The final expression would be:

([\w\s]*\w)(?:,\s*$|,\s*|\s*\()([\w\s\/]*\w)$?\s*(?:$?([\w\s+\/]+)$?)?

Edit: fixed a problem that made the final group capture twice, once with the parens, once without. Now it should only capture the text inside the parens.

Testing it on your examples:

>>> re.match(r, 'caracas, venezuela').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas (venezuela)').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas, (venezuela) (df)').groups()
('caracas', 'venezuela', 'df')

Regular expression in python to capture multiple forms of badly formatted addresses

Answers (2)

Related Questions