Reputation: 2179
I have written this function,
def my_func(s):
wordlist = ('unit','room','lot')
if if any(re.match(r'^'+ word + r'\b' + r'.*$', s.lower()) for word in wordlist) and any(i.isdigit() for i in s.lower())::
if ',' in s:
out = re.findall(r"(.*),", s) #Getting everything before comma
return out[0]
else:
out = re.findall(r"([^\s]*\s[^\s]*)", s) #Getting everything before second space.
return out[0]
My test data and the expected output
Unity 11 Lane. --> None
Unit 11 queen street --> Unit 11
Unit 7, king street --> Unit 7
Lot 12 --> Lot 12
Unit street --> None
My logic here is
Everything else is working fine, how to capture Lot 12 here, say if the string matches wordlist and there is no ',' and no second space, then bring it all
Upvotes: 0
Views: 664
Reputation: 25799
You're overcomplicating this, it's a simple word + whitespace + digits match:
import re
def my_func(s):
wordlist = ('unit', 'room', 'lot')
result = re.match(r"((?:{})\s+\d+)".format("|".join(wordlist)), s, re.IGNORECASE)
if result:
return result.group()
Let's test it:
test_data = ["Unity 11 Lane.",
"Unit 11 queen street",
"Unit 7, king street",
"Lot 12",
"Unit street"]
for entry in test_data:
print("{} --> {}".format(entry, my_func(entry)))
Which gives:
Unity 11 Lane. --> None
Unit 11 queen street --> Unit 11
Unit 7, king street --> Unit 7
Lot 12 --> Lot 12
Unit street --> None
If you really want to match everything before a whitespace, a comma or EOL, you can do it by replacing the regex with:
result = re.match(r"((?:{})\s+.+?(?=\s|,|$))".format("|".join(wordlist)), s, re.IGNORECASE)
But this will match one of your undesired strings because the pattern cannot know that you like and
but don't like street
:
Unity 11 Lane. --> None
Unit 11 queen street --> Unit 11
Unit 7, king street --> Unit 7
Lot 12 --> Lot 12
Unit street --> Unit street
Upvotes: 1