Reputation: 14096

Match last occurrence with regex

I would like to match last occurrence of a pattern using regex.

I have some text structured this way:

Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>

I want to match the last text between two   in my case  Tizi Ouzou , ideally the Tizi Ouzou string

Note that there is some white spaces after the last  

I've tried this:

<br>.*<br>\s*$

but it selects everything starting from the first   to the last.

NB: I'm on python, and I'm using pythex to test my regex

Upvotes: 7

Answers (6)

moliware

Reputation: 10278

For me the clearest way is:

>>> re.findall('<br>(.*?)<br>', text)[-1]
'Tizi Ouzou'

Upvotes: 15

Casimir et Hippolyte

Reputation: 89557

You can use in greedy quantifier with a reduced character class (assuming you have no tags between you  ):

<br>([^<]*)<br>\s*$

<br>((?:[^<]+|<(?!br>))*)<br>\s*$

to allow tags inside.

Since the string you search is Tizi Ouzou without   you can extract the first capturing group.

Upvotes: 6

Birei

Reputation: 36262

Try:

re.match(r'(?s).*<br>(?=.*<br>)(.*)<br>', s).group(1)

It first consumes all data until last   and backtracks until it checks with a look-ahead that there is another   after it, and then extracts the content between them.

It yields:

Tizi Ouzou

EDIT: No need to look-ahead. Alternative (with same result) based in comment of m.buettner

re.match(r'(?s).*<br>(.*)<br>', s).group(1)

Upvotes: 3

Martin Ender

Reputation: 44259

Have a look at the related questions: you shouldn't parse HTML with regex. Use a regex parser instead. For Python, I hear Beautiful Soup is the way to go.

Anyway, if you want to do it with regex, you need to make sure that .* cannot go past another  . To do that, before consuming each character we can use a lookahead to make sure that it doesn't start another  :

<br>(?:(?!<br>).)*<br>\s*$

Upvotes: 8

alecxe

Reputation: 473863

How about [^<>]* instead of .*:

import re


text = """Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """


print re.search('<br>([^<>]*)<br>\s*$', text).group(1)

prints

Tizi Ouzou

Upvotes: 4

Jon Clements

Reputation: 142156

A non regex approach using the builtin str functions:

text = """
Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>       """

res = text.rsplit('<br>', 2)[-2]
#Tizi Ouzou

Upvotes: 13

Match last occurrence with regex

Answers (6)

Related Questions