Reputation: 167
I'm trying to filter street names and get the parts that I want. The names come in several formats. Here are some examples and what I want from them.
Car Cycle 5 B Ap 1233 < what I have
Car Cycle 5 B < what I want
Potato street 13 1 AB < what I have
Potato street 13 < what I want
Chrome Safari 41 Ap 765 < what I have
Chrome Safari 41 < what I want
Highstreet 53 Ap 2632/BH < what I have
Highstreet 53 < what I want
Something street 91/Daniel < what I have
Something street 91 < what I want
Usually what I want is the street name (1-4 names) followed by the street number if there is one and then the street letter (1 letter) if there is one. I just can't get it to work right.
Here is my code (I know, it sucks):
import re
def address_regex(address):
regex1 = re.compile("(\w+ ){1,4}(\d{1,4} ){1}(\w{1} )")
regex2 = re.compile("(\w+ ){1,4}(\d{1,4} ){1}")
regex3 = re.compile("(\w+ ){1,4}(\d){1,4}")
regex4 = re.compile("(\w+ ){1,4}(\w+)")
s1 = regex1.search(text)
s2 = regex2.search(text)
s3 = regex3.search(text)
s4 = regex4.search(text)
regex_address = ""
if s1 != None:
regex_address = s1.group()
elif s2 != None:
regex_address = s2.group()
elif s3 != None:
regex_address = s3.group()
elif s4 != None:
regex_address = s4.group()
else:
regex_address = address
return regex_address
I'm using Python 3.4
Upvotes: 3
Views: 84
Reputation: 5268
I'm going to go out on a limb here and assume in your last example you actually want to catch the number 91, because it makes no sense not to.
Here's a solution which catches all your examples (and your last, but including the 91):
^([\p{L} ]+ \d{1,4}(?: ?[A-Za-z])?\b)
^
Start match at beginning of string[\p{L} ]+
Character class of space or unicode character belonging to the "letter" category, 1-infinity times\d{1,4}
Number, 1-4 times(?: ?[A-Za-z])?
Non-capture group of optional space and a single letter, 0-1 timesCapture group 1 is the entire address. I didn't quite understand the logic behind your grouping, but feel free to group it however you prefer.
Upvotes: 3
Reputation: 14089
This works for the 5 samples you provided
^([a-z]+\s+)*(\d*(?=\s))?(\s+[a-z])*\b
Set multiline mode and case insensitivity to on. That's (?im) if your regex support it.
Upvotes: 0
Reputation: 17877
Maybe you like a more readable Python version (no regex):
import string
names = [
"Car Cycle 5 B Ap 1233",
"Potato street 13 1 AB",
"Chrome Safari 41 Ap 765",
"Highstreet 53 Ap 2632/BH",
"Something street 91/Daniel",
]
for name in names:
result = []
words = name.split()
while any(words) and all(c in string.ascii_letters for c in words[0]):
result += [words[0]]
words = words[1:]
if any(words) and all(c in string.digits for c in words[0]):
result += [words[0]]
words = words[1:]
if any(words) and words[0] in string.ascii_uppercase:
result += [words[0]]
words = words[1:]
print " ".join(result)
Output:
Car Cycle 5 B
Potato street 13
Chrome Safari 41
Highstreet 53
Something street
Upvotes: 0