Reputation: 95
Hi I have a regex expression
<a href="(.+?)" class="nextpostslink">
This Regex works fine on the following html
'>
<span class='pages'>Page 1 of 12</span><span class='current'>1</span><a href='http://cinemassacre.com/category/avgn/page/2/' class='page larger'>2</a><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">»</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last »</a>
</div> </div>
The part I am trying to extract is the next page url from
<a href="http://cinemassacre.com/category/avgn/page/2/" class="nextpostslink">
But when I run this regex on this block of HTML
'>
<span class='pages'>Page 2 of 12</span><a href="http://cinemassacre.com/category/avgn/" class="previouspostslink">«</a><a href='http://cinemassacre.com/category/avgn/' class='page smaller'>1</a><span class='current'>2</span><a href='http://cinemassacre.com/category/avgn/page/3/' class='page larger'>3</a><a href='http://cinemassacre.com/category/avgn/page/4/' class='page larger'>4</a><a href='http://cinemassacre.com/category/avgn/page/5/' class='page larger'>5</a><a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">»</a><span class='extend'>...</span><a href='http://cinemassacre.com/category/avgn/page/12/' class='last'>Last »</a>
</div>
</div>
It extracts everything from the first <a href="
to " class="nextpostslink">
Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount.
Which should be <a href="http://cinemassacre.com/category/avgn/page/3/" class="nextpostslink">
The complete python code im using is
match=re.compile('<a href="(.+?)" class="nextpostslink">', re.DOTALL).findall(pagenav)
Upvotes: 2
Views: 727
Reputation: 17062
As I understand it, the greediness works from the beginning of the regex--i.e., it finds <a href="
, and then the non-greediness has it stop at the first " class="nextpostslink">
instead of the last one, like the greedy version would do.
You're best off using BeautifulSoup here:
from bs4 import BeautifulSoup as BS
soup = BS(html)
print soup.find("a", "nextpostslink").attrs['href']
# returns u'http://cinemassacre.com/category/avgn/page/2/'
Upvotes: 3
Reputation: 500475
It extracts everything from the first Why does this happen? I thought (.+?) was non greedy, so it should extract the minimal amount
It is non-greedy. However, the fact that you have a mandatory class="nextpostslink">
regex forces the engine to match everything until it finds class="nextpostslink">
.
Upvotes: 1
Reputation: 44259
The start of your match is always greedy in a sense. That is because the engine attempts matches from left to right in your subject string. The first <a href="
is encountered, which is fine, and then the engine goes ahead and consumes everything with .+?
until the match is completed (it stops as soon as possible, due to the .+?
). But it does not try to start the match as far right as possible, because the match is just fine. Hence, you could say using ?
makes the end of the match ungreedy (taking the first possible end of the match), but the start of the match will always be greedy (the match will always begin at the leftmost possible position, no matter how you try to make it ungreedy).
This is why there is often a better alternative to ungreedy repetition: exclude the delimiter from the repetition:
<a href="([^"]*)" class="nextpostslink">
This can never go past the closing "
, so there is no need to worry that anything outside of the attribute or tag will be part of the match.
Let me add anyway, that you should not use regular expressions to parse HTML. What if '
is used instead of "
(as in your second anchor tag in the given example)? What if there are multiple spaces between your attributes? What if there are more attributes than just href
and class
? What if the class
attribute is listed before the href
attribute?
jdotjdot's answer has a good example of how to do it the right way in Python.
Upvotes: 3