Reputation: 33
I am trying to get the shortest matching of the pattern '''.*?''' is a [[.*?]]
for sentences such as
'''fermentation starter''' is a preparation to assist the beginning of the [[fermentation (biochemistry)|fermentation]]. A '''starter culture''' is a [[microbiological culture]]
which contains the target string
'''starter culture''' is a [[microbiological culture]]
The idea is to get the later string. To do so, I am using the following python code:
regex = re.compile("'''.*?''' is a \[\[.*?\]\]")
re.findall(regex, line)
However, I am getting the full sentence instead of the shortest pattern. Note that I have added '?' after the qualifier to make the match perform in a non-greedy fashion. Also I can solve it using
re.findall(regex, line[30:])
in order to escape the first occurrence of '''.*?'''
, but I am looking for a more natural solution.
Upvotes: 3
Views: 490
Reputation: 12590
If you're sure that you will not have '[' inside ''' '''
a simple solution is this:
regex = re.compile("'''[^[]*?''' is a \[\[.*?\]\]")
regex.findall(line)
Or you could do the same thing but with '
:
regex = re.compile("'''[^']*''' is a \[\[.*?\]\]")
regex.findall(line)
Upvotes: 0
Reputation: 785186
You can use this lookahead based regex:
>>> print re.findall(r"'''(?:(?!''').)*''' is a \[\[.*?\]\]", line)
["'''starter culture''' is a [[microbiological culture]]"]
(?:(?!''').)*
will match 0 or more of any character that does not have '''
at next position thus making sure to match shortest match between two '''
.
Upvotes: 2