Reputation: 1616
I am trying to extract locations using Regex in python. Right now I am doing this:
def get_location(s):
s = s.strip(STRIP_CHARS)
keywords = "at|outside|near"
location_pattern = "(?P<location>((?P<place>{keywords}\s[A-Za-z]+)))".format(keywords = keywords)
location_regex = re.compile(location_pattern, re.IGNORECASE | re.MULTILINE | re.UNICODE | re.DOTALL | re.VERBOSE)
for match in location_regex.finditer(s):
match_str = match.group(0)
indices = match.span(0)
print ("Match", match)
match_str = match.group(0)
indices = match.span(0)
print (match_str)
get_location("Im at building 3")
I have three issues:
captures = match.capturesdict()
I am not able to use to extract captures where this is working other examples.location_pattern = 'at|outside\s\w+
. It seems to be working. Can someone explains what I am doing wrong?Upvotes: 3
Views: 974
Reputation: 626806
The main problem here is that you need to place {keywords}
inside a non-capturing group: (?:{keywords})
. Here is a schematic example: a|b|c\s+\w+
matches either a
or b
or c
+<whitespace(s)>
+. When you put the alternation list into a group,
(a|b|c)\s+\w+, it matches either
a, or
bor
c` and only then it tries to match whitespaces and then word chars.
See the updated code (a demo online):
import regex as re
def get_location(s):
STRIP_CHARS = '*'
s = s.strip(STRIP_CHARS)
keywords = "at|outside|near"
location_pattern = "(?P<location>((?P<place>(?:{keywords})\s+[A-Za-z]+)))".format(keywords = keywords)
location_regex = re.compile(location_pattern, re.IGNORECASE | re.UNICODE)
for match in location_regex.finditer(s):
match_str = match.group(0)
indices = match.span(0)
print ("Match", match)
match_str = match.group(0)
indices = match.span(0)
print (match_str)
captures = match.capturesdict()
print(captures)
get_location("Im at building 3")
Output:
('Match', <regex.Match object; span=(3, 14), match='at building'>)
at building
{'place': ['at building'], 'location': ['at building']}
Note that location_pattern = 'at|outside\s\w+
is NOT working since at
is matched everywhere, and outside
must be followed with a whitespace and word chars. You may fix it the same way: (at|outside)\s\w+
.
If you put the keywords into a group, the the captures = match.capturesdict()
will work well (see the output above).
Upvotes: 1