shibin
shibin

Reputation: 73

Regular expression to extract a specific value from HTML anchors

I am trying to extract http://xyz.com/5 link from the string below. You can see that only for that one we have the class="next" attribute. So I am trying to get that based on this attribute.

<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>

I tried below pattern but this returns all links in the entire text.

<a href='(.+?)' class="next">

(I understand from this site that using regular expressions to parse HTML is a bad idea, but I have to do this for now.)

Upvotes: 0

Views: 827

Answers (2)

TerryA
TerryA

Reputation: 59974

Please don't use regex to parse HTML. Use something like BeautifulSoup. It's so much easier and better :p

from bs4 import BeautifulSoup as BS
html = """<a href='http://xyz.com/1' class='page larger'>2</a>
<a href='http://xyz.com/2' class='page larger'>3</a>
<a href='http://xyz.com/3' class='page larger'>4</a>
<a href='http://xyz.com/4' class='page larger'>5</a>
<a href='http://xyz.com/5' class="next">»</a>"""
soup = BS(html)
for atag in soup.find_all('a', {'class':'next'}):
    print atag['href']

With your example, this prints:

http://xyz.com/5

Also, your regular expression works fine.

Upvotes: 2

Barmar
Barmar

Reputation: 780949

Try this regexp:

<a href='([^']+)' class="next">

Making a regular expression non-greedy doesn't mean it will always find the shortest match. It just means that once it has found a match it will return it, it won't keep looking for a longer match. Put another way, it will uses the shortest match at the right-hand end of the wildcard, but not the left-hand side.

So your regular expression was matching at the beginning of the first link, and continuing until it found class = "next". Instead of using .+?, using [^']+ means that the wildcard will not cross attribute boundaries, so you're assured of matching just one link.

Upvotes: 2

Related Questions