silent_dev
silent_dev

Reputation: 1616

how to extract words following a list of keywords in python using regex?

I am trying to extract locations using Regex in python. Right now I am doing this:

def get_location(s):
    s = s.strip(STRIP_CHARS)
    keywords = "at|outside|near"
    location_pattern = "(?P<location>((?P<place>{keywords}\s[A-Za-z]+)))".format(keywords = keywords)
    location_regex = re.compile(location_pattern, re.IGNORECASE | re.MULTILINE | re.UNICODE | re.DOTALL | re.VERBOSE)

    for match in location_regex.finditer(s):
        match_str = match.group(0)
        indices = match.span(0)
        print ("Match", match)
        match_str = match.group(0)
        indices = match.span(0)
        print (match_str)

get_location("Im at building 3")

I have three issues:

  1. It is only giving "at" as output but it should also give building.
  2. captures = match.capturesdict() I am not able to use to extract captures where this is working other examples.
  3. When I am doing just this location_pattern = 'at|outside\s\w+. It seems to be working. Can someone explains what I am doing wrong?

Upvotes: 3

Views: 974

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626806

The main problem here is that you need to place {keywords} inside a non-capturing group: (?:{keywords}). Here is a schematic example: a|b|c\s+\w+ matches either a or b or c+<whitespace(s)>+. When you put the alternation list into a group,(a|b|c)\s+\w+, it matches eithera, orborc` and only then it tries to match whitespaces and then word chars.

See the updated code (a demo online):

import regex as re
def get_location(s):
    STRIP_CHARS = '*'
    s = s.strip(STRIP_CHARS)
    keywords = "at|outside|near"
    location_pattern = "(?P<location>((?P<place>(?:{keywords})\s+[A-Za-z]+)))".format(keywords = keywords)
    location_regex = re.compile(location_pattern, re.IGNORECASE | re.UNICODE)

    for match in location_regex.finditer(s):
        match_str = match.group(0)
        indices = match.span(0)
        print ("Match", match)
        match_str = match.group(0)
        indices = match.span(0)
        print (match_str)
        captures = match.capturesdict()
        print(captures)

get_location("Im at building 3")

Output:

('Match', <regex.Match object; span=(3, 14), match='at building'>)
at building
{'place': ['at building'], 'location': ['at building']}

Note that location_pattern = 'at|outside\s\w+ is NOT working since at is matched everywhere, and outside must be followed with a whitespace and word chars. You may fix it the same way: (at|outside)\s\w+.

If you put the keywords into a group, the the captures = match.capturesdict() will work well (see the output above).

Upvotes: 1

Related Questions