Kr0nZ
Kr0nZ

Reputation: 95

Python regex, matching too much

Hi I have a regex expression
<a href="(.+?)" class="nextpostslink">

This Regex works fine on the following html
'> <span class='pages'>Page 1 of 12</span><span class='current'>1</span><a href='http://cinemassacre.com/category/avgn/page/2/' class='page larger'>2</a><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">&raquo;</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last &raquo;</a> </div> </div>

The part I am trying to extract is the next page url from
<a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">

But when I run this regex on this block of HTML
'> <span class='pages'>Page 2 of 12</span><a href="http://cinemassacre.com/category/avgn/" class="previouspostslink">&laquo;</a><a href='http://cinemassacre.com/category/avgn/' class='page smaller'>1</a><span class='current'>2</span><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">&raquo;</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last &raquo;</a> </div>
</div>


It extracts everything from the first <a href=" to " class="nextpostslink">
Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount.
Which should be <a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">

The complete python code im using is
match=re.compile('<a href="(.+?)" class="nextpostslink">', re.DOTALL).findall(pagenav)

Upvotes: 2

Views: 727

Answers (3)

jdotjdot
jdotjdot

Reputation: 17062

As I understand it, the greediness works from the beginning of the regex--i.e., it finds <a href=", and then the non-greediness has it stop at the first " class="nextpostslink"> instead of the last one, like the greedy version would do.

You're best off using BeautifulSoup here:

from bs4 import BeautifulSoup as BS
soup = BS(html)
print soup.find("a", "nextpostslink").attrs['href']
# returns u'http://cinemassacre.com/category/avgn/page/2/'

Upvotes: 3

NPE
NPE

Reputation: 500475

It extracts everything from the first Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount

It is non-greedy. However, the fact that you have a mandatory class="nextpostslink"> regex forces the engine to match everything until it finds class="nextpostslink">.

Upvotes: 1

Martin Ender
Martin Ender

Reputation: 44259

The start of your match is always greedy in a sense. That is because the engine attempts matches from left to right in your subject string. The first <a href=" is encountered, which is fine, and then the engine goes ahead and consumes everything with .+? until the match is completed (it stops as soon as possible, due to the .+?). But it does not try to start the match as far right as possible, because the match is just fine. Hence, you could say using ? makes the end of the match ungreedy (taking the first possible end of the match), but the start of the match will always be greedy (the match will always begin at the leftmost possible position, no matter how you try to make it ungreedy).

This is why there is often a better alternative to ungreedy repetition: exclude the delimiter from the repetition:

<a href="([^"]*)" class="nextpostslink">

This can never go past the closing ", so there is no need to worry that anything outside of the attribute or tag will be part of the match.

Let me add anyway, that you should not use regular expressions to parse HTML. What if ' is used instead of " (as in your second anchor tag in the given example)? What if there are multiple spaces between your attributes? What if there are more attributes than just href and class? What if the class attribute is listed before the href attribute?

jdotjdot's answer has a good example of how to do it the right way in Python.

Upvotes: 3

Related Questions