Reputation: 14096
I would like to match last occurrence of a pattern using regex.
I have some text structured this way:
Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>
I want to match the last text between two <br>
in my case <br>Tizi Ouzou<br>
, ideally the Tizi Ouzou
string
Note that there is some white spaces after the last <br>
I've tried this:
<br>.*<br>\s*$
but it selects everything starting from the first <br>
to the last.
NB: I'm on python, and I'm using pythex to test my regex
Upvotes: 7
Views: 15684
Reputation: 10278
For me the clearest way is:
>>> re.findall('<br>(.*?)<br>', text)[-1]
'Tizi Ouzou'
Upvotes: 15
Reputation: 89557
You can use in greedy quantifier with a reduced character class (assuming you have no tags between you <br>
):
<br>([^<]*)<br>\s*$
or
<br>((?:[^<]+|<(?!br>))*)<br>\s*$
to allow tags inside.
Since the string you search is Tizi Ouzou
without <br>
you can extract the first capturing group.
Upvotes: 6
Reputation: 36262
Try:
re.match(r'(?s).*<br>(?=.*<br>)(.*)<br>', s).group(1)
It first consumes all data until last <br>
and backtracks until it checks with a look-ahead that there is another <br>
after it, and then extracts the content between them.
It yields:
Tizi Ouzou
EDIT: No need to look-ahead. Alternative (with same result) based in comment of m.buettner
re.match(r'(?s).*<br>(.*)<br>', s).group(1)
Upvotes: 3
Reputation: 44259
Have a look at the related questions: you shouldn't parse HTML with regex. Use a regex parser instead. For Python, I hear Beautiful Soup is the way to go.
Anyway, if you want to do it with regex, you need to make sure that .*
cannot go past another <br>
. To do that, before consuming each character we can use a lookahead to make sure that it doesn't start another <br>
:
<br>(?:(?!<br>).)*<br>\s*$
Upvotes: 8
Reputation: 473863
How about [^<>]*
instead of .*
:
import re
text = """Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """
print re.search('<br>([^<>]*)<br>\s*$', text).group(1)
prints
Tizi Ouzou
Upvotes: 4
Reputation: 142156
A non regex approach using the builtin str
functions:
text = """
Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """
res = text.rsplit('<br>', 2)[-2]
#Tizi Ouzou
Upvotes: 13