Reputation: 139
I have a regex job to search for a pattern
(This) <some words> (is a/was a) <some words> (0-4 digit number) <some words> (word)
where <some words>
can be any number of words/charecters including a space.
I used the following to get achieve this.
(^|\W)This(?=\W).*?(?<=\W)(is a|was a)(?=\W).*?(?<=\W)(\d{1,4})((?=\W).*?(?<=\W))*(word)(?=\W)
I also have another constrain: the total length of the match should be less than 30 char. Currently, my search works for all lengths and searches for all sets of words. Is there an option in regex which I can use to achieve this constrain using the regex string itself?
I am currently getting this done by looking at the length of the matched regex objects. I have to deal with strings that are more than the required length and this is causing issues which misses some detections which are under the length constrain.
for eg: string:
"hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish."
has 2 matches:
My search captures the first one and misses the second. But the second one matches my length criteria.
If the first match is less than the length constrain then I can ignore the second match.
I am using re.sub() to replace those strings and use a repl function inside sub() to check the length. My dataset is large, so the search takes a lot of time. The most important thing to me is to do the search efficiently including the length constraints so as to avoid these incorrect matches.
I am using python 3
Thanks in advance
Upvotes: 1
Views: 109
Reputation: 488
The regex engine doesn't provide a method to do exactly what you're asking for; you'd need to use regex in conjunction with another tool to get the result you want.
Building on some of the comments on your question, the following regex will return the entire match (everything from 'This' through 'word'):
\b(?=([Tt]his\b.+?\b(?:i|wa)s a\b.+?\b\d{1,4}\b.+?\bword))\b
You can then filter the results to only produce the output you're looking for.
import re
string = 'hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish.'
pat = re.compile(r'\b(?=([Tt]his\b.*?\b(?:i|wa)s a\b.*?\b\d{1,4}\b.*?\bword))\b')
# returns ['this is a 12 word']
[x[1] for x in pat.finditer(string) if len(x[1]) < 30]
Upvotes: 1