Reputation: 95

Python regex, matching too much

Hi I have a regex expression
<a href="(.+?)" class="nextpostslink">

This Regex works fine on the following html
'> <span class='pages'>Page 1 of 12</span><span class='current'>1</span><a href='http://cinemassacre.com/category/avgn/page/2/' class='page larger'>2</a><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">»</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last »</a> </div> </div>

The part I am trying to extract is the next page url from
<a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">

But when I run this regex on this block of HTML
'> <span class='pages'>Page 2 of 12</span><a href="http://cinemassacre.com/category/avgn/" class="previouspostslink">«</a><a href='http://cinemassacre.com/category/avgn/' class='page smaller'>1</a><span class='current'>2</span><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">»</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last »</a> </div> </div>

It extracts everything from the first <a href=" to " class="nextpostslink">
Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount.
Which should be <a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">

The complete python code im using is
match=re.compile('<a href="(.+?)" class="nextpostslink">', re.DOTALL).findall(pagenav)

Upvotes: 2

Answers (3)

jdotjdot

Reputation: 17092

As I understand it, the greediness works from the beginning of the regex--i.e., it finds <a href=", and then the non-greediness has it stop at the first " class="nextpostslink"> instead of the last one, like the greedy version would do.

You're best off using BeautifulSoup here:

from bs4 import BeautifulSoup as BS
soup = BS(html)
print soup.find("a", "nextpostslink").attrs['href']
# returns u'http://cinemassacre.com/category/avgn/page/2/'

Upvotes: 3

NPE

Reputation: 500933

It extracts everything from the first Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount

It is non-greedy. However, the fact that you have a mandatory class="nextpostslink"> regex forces the engine to match everything until it finds class="nextpostslink">.

Upvotes: 1

Martin Ender

Reputation: 44289

The start of your match is always greedy in a sense. That is because the engine attempts matches from left to right in your subject string. The first <a href=" is encountered, which is fine, and then the engine goes ahead and consumes everything with .+? until the match is completed (it stops as soon as possible, due to the .+?). But it does not try to start the match as far right as possible, because the match is just fine. Hence, you could say using ? makes the end of the match ungreedy (taking the first possible end of the match), but the start of the match will always be greedy (the match will always begin at the leftmost possible position, no matter how you try to make it ungreedy).

This is why there is often a better alternative to ungreedy repetition: exclude the delimiter from the repetition:

<a href="([^"]*)" class="nextpostslink">

This can never go past the closing ", so there is no need to worry that anything outside of the attribute or tag will be part of the match.

Let me add anyway, that you should not use regular expressions to parse HTML. What if ' is used instead of " (as in your second anchor tag in the given example)? What if there are multiple spaces between your attributes? What if there are more attributes than just href and class? What if the class attribute is listed before the href attribute?

jdotjdot's answer has a good example of how to do it the right way in Python.

Upvotes: 3

Python regex, matching too much

Answers (3)

Related Questions