J G Moreno
J G Moreno

Reputation: 33

how to get the shortest matching with python (complex non-greedy pattern)

I am trying to get the shortest matching of the pattern '''.*?''' is a [[.*?]] for sentences such as

'''fermentation starter''' is a preparation to assist the beginning of the [[fermentation (biochemistry)|fermentation]]. A '''starter culture''' is a [[microbiological culture]]

which contains the target string

 '''starter culture''' is a [[microbiological culture]]

The idea is to get the later string. To do so, I am using the following python code:

regex = re.compile("'''.*?''' is a \[\[.*?\]\]")
re.findall(regex, line)

However, I am getting the full sentence instead of the shortest pattern. Note that I have added '?' after the qualifier to make the match perform in a non-greedy fashion. Also I can solve it using

re.findall(regex, line[30:])

in order to escape the first occurrence of '''.*?''', but I am looking for a more natural solution.

Upvotes: 3

Views: 490

Answers (2)

Etienne
Etienne

Reputation: 12590

If you're sure that you will not have '[' inside ''' ''' a simple solution is this:

regex = re.compile("'''[^[]*?''' is a \[\[.*?\]\]")
regex.findall(line)

Or you could do the same thing but with ':

regex = re.compile("'''[^']*''' is a \[\[.*?\]\]")
regex.findall(line)

Upvotes: 0

anubhava
anubhava

Reputation: 785186

You can use this lookahead based regex:

>>> print re.findall(r"'''(?:(?!''').)*''' is a \[\[.*?\]\]", line)
["'''starter culture''' is a [[microbiological culture]]"]

(?:(?!''').)* will match 0 or more of any character that does not have ''' at next position thus making sure to match shortest match between two '''.

RegEx Demo

Upvotes: 2

Related Questions