Reputation: 4077
I have several strings from which I need to extract the block numbers. The block numbers are of the format type "3rd block" , "pine block" ,"block 2" and "block no 4". Please note that is just the format type and the numbers could change. I have added them in OR conditions .
The problem is that at times the regex extracts the previous word connected to something else like "main phase block 2" would mean I need "block 2" to be extracted . Using re.search causes the 1st result to turn up and there are even limitations of "OR".
What I want is to add exceptions or condition my regex with something like
if 1 or 2 digits (like 23 , 3 ,6 ,7 etc) occur before the word "block", extract "block" with the word following "block".
Eg :
string = "rmv clusters phase 2 block 1 , flat no 209 dev." #extract "block 1" and not "2 block".
if words "phase , apartment or building" come before "block", extract word that follows block (irrespective of whether its a number or word)
Eg:
string 2 = "sky line apartments block 2 chandra layout" #extract "block 2" and not "apartments block"
Here is what I have done. But I've got no idea about adding conditions.
p = re.compile(r'(block[^a-z]\s\d*)|(\w+\sblock[^a-z])|(block\sno\s\d+)')
q = p.search(str)
this is a part of an entire function.
Upvotes: 0
Views: 143
Reputation: 395085
Tested on Python 2.7 and 3.3.
import re
strings = ("rmv clusters phase 2 block 1 , flat no 209 dev."
"sky line apartments block 2 chandra layout"
"foo bar 99 block baz") # tests rule 1.
Here's the rules you stated you wanted:
So
regex = re.compile(r'''
(?:\d{1,2}\s)(block\s\w*) # rule 1
| # or
(?:(phase|apartment|building).*?)(block\s\w+) # rule 2
''', re.X)
found = regex.finditer(strings)
for i in found:
print(i.groups())
prints:
(None, 'phase', '1')
(None, 'apartment', '2')
('block baz', None, None)
None is the default for a group if not found, so, you can pick a preference and allow the short-cutting or
to return the first if it's non-empty, or the second if the first is empty (i.e. evaluates as False in Python's boolean contexts).
>>> found = regex.finditer(strings)
>>> for i in found:
... print(i.group(1) or i.group(3))
...
1
2
block baz
So to put this thing into a simple function:
def block(str):
regex = re.compile(r'''
(?:\d{1,2}\s)(block\s\w*) # rule 1
| # or
(?:(phase|apartment|building).*?)(block\s\w+) # rule 2
''', re.X)
match = regex.search(str)
if not match:
return ''
else:
return match.group(1) or match.group(3) or ''
usage:
>>> block("foo bar 99 block baz")
'block baz'
>>> block("sky line apartments block 2 chandra layout")
'block 2'
Upvotes: 1
Reputation: 20456
>> import re
>>> string = "rmv clusters phase 2 block 1 , flat no 209 dev."
>>> string2 = "sky line apartments block 2 chandra layout"
>>> print re.findall(r'block\s+\d+', string)
['block 1']
>>> print re.findall(r'block\s+\d+', string2)
['block 2']
Upvotes: 1
Reputation: 7159
Why don't you write multiple regexes? See the following snippet in python3
def getBlockMatch(string):
import re
p1Regex = re.compile('block\s+\d+')
p2Regex = re.compile('(block[^a-z]\s\d*)|(\w+\sblock[^a-z])|(block\sno\s\d+)')
if p1Regex.search(string) is not None:
return p1Regex.findall(string)
else:
return p2Regex.findall(string)
string = "rmv clusters phase 2 block 1 , flat no 209 dev."
print(getBlockMatch(string))
string = "sky line apartments block 2 chandra layout"
print(getBlockMatch(string))
Outputs:
['block 1']
['block 2']
Upvotes: 1