how to get the shortest matching with python (complex non-greedy pattern)

Question

I am trying to get the shortest matching of the pattern '''.*?''' is a [[.*?]] for sentences such as

'''fermentation starter''' is a preparation to assist the beginning of the [[fermentation (biochemistry)|fermentation]]. A '''starter culture''' is a [[microbiological culture]]

which contains the target string

 '''starter culture''' is a [[microbiological culture]]

The idea is to get the later string. To do so, I am using the following python code:

regex = re.compile("'''.*?''' is a $$\[.*?$$\]")
re.findall(regex, line)

However, I am getting the full sentence instead of the shortest pattern. Note that I have added '?' after the qualifier to make the match perform in a non-greedy fashion. Also I can solve it using

re.findall(regex, line[30:])

in order to escape the first occurrence of '''.*?''', but I am looking for a more natural solution.

anubhava · Accepted Answer

You can use this lookahead based regex:

>>> print re.findall(r"'''(?:(?!''').)*''' is a $$\[.*?$$\]", line)
["'''starter culture''' is a [[microbiological culture]]"]

(?:(?!''').)* will match 0 or more of any character that does not have ''' at next position thus making sure to match shortest match between two '''.

RegEx Demo

how to get the shortest matching with python (complex non-greedy pattern)

Answers (2)

Related Questions