Reputation: 73
I am trying to extract http://xyz.com/5
link from the string below. You can see that only for that one we have the class="next"
attribute. So I am trying to get that based on this attribute.
<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>
I tried below pattern but this returns all links in the entire text.
<a href='(.+?)' class="next">
(I understand from this site that using regular expressions to parse HTML is a bad idea, but I have to do this for now.)
Upvotes: 0
Views: 827
Reputation: 59974
Please don't use regex to parse HTML. Use something like BeautifulSoup
. It's so much easier and better :p
from bs4 import BeautifulSoup as BS
html = """<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>"""
soup = BS(html)
for atag in soup.find_all('a', {'class':'next'}):
print atag['href']
With your example, this prints:
http://xyz.com/5
Also, your regular expression works fine.
Upvotes: 2
Reputation: 780949
Try this regexp:
<a href='([^']+)' class="next">
Making a regular expression non-greedy doesn't mean it will always find the shortest match. It just means that once it has found a match it will return it, it won't keep looking for a longer match. Put another way, it will uses the shortest match at the right-hand end of the wildcard, but not the left-hand side.
So your regular expression was matching at the beginning of the first link, and continuing until it found class = "next"
. Instead of using .+?
, using [^']+
means that the wildcard will not cross attribute boundaries, so you're assured of matching just one link.
Upvotes: 2