John
John

Reputation: 2952

Python regular expressions with overlap

Given a string like:

text = 'jack betty "donald jake rita" lorie katie danny'

and a list of keywords like:

keywords = ["jack", "donald", "rita", "katie"]

how do I only return the keywords that were found within the double quotes?

Expected return:

["donald", "rita"]

I got as far as:

import re
import string

keywords = ["jack", "donald", "rita", "katie"]
text = 'jack betty "donald jake rita" lorie katie danny'

pattern_template = string.Template(r'(?:^[^"]*["][^"]*)($keywords)')
search_pattern = pattern_template.safe_substitute(keywords="|".join(keywords))

# This gives me a search_pattern of: (?:^[^"]*["][^"]*)(jack|donald|rita|katie)

matches = [match.group(1) for match in re.finditer(search_pattern, text)]

print(matches)

My search pattern starts with the beginning of the string, skips any non-quote characters, matches a quote, skips 0 or more additional non-quote characters, and looks for one of the keywords.

This returns only:

['rita']

If I get rid of "rita" from the keywords, it only returns:

['donald']

Above, I'm only checking the first capturing group with match.group(1), but if I check .groups() to get a tuple of all capturing groups, I still only see [('rita',)]

I think the problem is overlapping matches. Is there a way to return all overlapping matches?

Upvotes: 0

Views: 386

Answers (4)

Cary Swoveland
Cary Swoveland

Reputation: 110725

Here are two ways that can be done in one pass.

Match what you don't want, capture what you do want

With this approach by matching the following regular expression we obtain the words of interest in capture group 1, paying no attention to the matches that do not result in a capture.

(?:^|\").*?(?:$|\")|(\b(?:jake|donald|rita|katie)\b)

This expression must of course be constructed programmatically from the array keywords, which is not difficult.

Demo

This makes use of the fact that well-formed strings contain an even number of double-quotes.

The string can be broken down as follows.

(?:^|\")  # match the beginning of the string or a double-quote
.*?       # match zero or more characters lazily
(?:$|\")  # match the end of the string or a double-quote
|         # or
(         # begin capture group 1
  \b      # match word boundary
  (?:jake|donald|rita|katie) # match one of the four words 
  \b      # match word boundary
)         # end capture group 1

The word boundaries are to avoid matching words such as 'heritage' (which contains the name of a lovely meter maid).


Count trailing double-quotes to determine if a word is within a quoted string

Observe that a word is within a doubly-quoted string if and only if the part of the string that follows the word contains an odd number of double quotes. Suppose, for example, the string were:

'jack betty "donald jake rita" lorie katie danny "katie"'

             ^^^^^^ ^^^^ ^^^^                     ^^^^^

"donald", "jake" and "rita" are each followed by three double-quotes, so we infer they are within a doubly-quoted string. Same for "katie" which followed by one double-quote. All the remaining words are followed by a an even number of double-quotes so we infer they are not within a double-quoted string. Naturally, this is assuming that the entire string is well-formed in the sense that it contains an even number of double-quotes.

You can use re.findall with the following regular expression to identify the words of interest.

\b(?:jake|donald|rita|katie)\b(?=[^\"]*(?:(?:\"[^\"]*){2})*\"[^\"]*$)

Demo.

The regex can be broken down as follows.

\b             # match a word boundary
(?:jake|donald|rita|katie) # match one of four words in a non-capture group
\b             # match a word boundary
(?=            # begin a positive lookhead 
  [^\"]*       # match zero or more chars other than dq's (double-quotes)
  (?:          # begin a non-capture group                     
    (?:        # begin a non-capture group
      \"[^\"]* # match a dq followed by zero or more chars other than a dq
    ){2}       # end non-capture group and execute twice
  )*           # end non-capture group and execute zero or more times
  \"[^\"]*     # match a dq followed by zero or more chars other than a dq 
  $            # match end of string
)              # end positive lookahead

Use of this expression may require considerable backtracking by the regex engine. As shown at the link, the regex engine required 261 steps to parse the string given above.

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163457

Assuming the double quotes are paired up for the whole text, you can get all the matches between double quotes in a capture group.

Then split the match on 1 or more whitespace chars, and get the intersection of the 2 collections.

The pattern also matches escaped double quotes in between the double quotes.

import re

pattern = r'"([^"\\]*(\\.[^"\\]*)*)"'
text = 'jack betty "donald jake rita" lorie katie danny, "donald and brad and katie"'
keywords = ["jack", "donald", "rita", "katie"]

for matchNum, match in enumerate(re.finditer(pattern, text), start=1):
    print(list(set(keywords) & set(re.split(r"\s+", match.group(1))))) 

Output

['donald', 'rita']
['katie', 'donald']

See a Python demo and a regex demo.

Upvotes: 2

Dean Taylor
Dean Taylor

Reputation: 42021

Note: I don't know python but this would be my attempt

Code

import re
import string

keywords = ["jack", "donald", "rita", "katie"]
text = 'jack betty "donald jake rita" lorie katie danny'

# keep only the keywords in quotes
result = re.sub('[^"]*"([^"]+)"[^"]*', r"\1 ", text)

# split the unquoted word list into separate keywords
result = re.split(r"\s+", result, 0)

# Get intersection of two keyword lists
result = list(set(keywords) & set(result))

print(result)

Output

['donald', 'rita']

Upvotes: 2

MikeM
MikeM

Reputation: 13641

You can't get multiple matches if each match must begin at a particular index of the string - in your case the start ^ - because the same match would be found each time. Subsequent matches can only be found from a position beginning after the end of the previous match.

Here's one way of doing it using a positive look-ahead, which, being a zero-width assertion, does not consume the characters it matches so the next match does not have to start from the end of what it matches:

import re
import string

keywords = ["jack", "donald", "rita", "katie"]
text = 'jack betty "donald jake rita" lorie katie danny'

pattern_template = string.Template(r'\b($keywords)\b(?=[^"]*"[^"]*$)')

search_pattern = pattern_template.safe_substitute(keywords="|".join(keywords))

matches = [match.group(1) for match in re.finditer(search_pattern, text)]

print(search_pattern)
print()
print(matches)

If the text can have multiple quoted sections then an alternative positive look-ahead like

(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)

may be good enough. It asserts that there are an odd number of " ahead in the string.

We have to look forward because Python does not allow variable-length look-behinds.

Upvotes: 0

Related Questions