Reputation:
My input is similar to this:
<a href="link">text</a> <a href="correctLink">See full summary</a>
From this string i want to get only correctLink
(the link that has See full summary as text) .
I'm working with python, and i tried:
re.compile( '<a href="(.*?)">See full summary</a>', re.DOTALL | re.IGNORECASE )
but the only string i get with findall()
is link">text</a> <a href="correctLink
.
Where is my mistake?
Upvotes: 0
Views: 385
Reputation: 1122552
Limit your link pattern to non-quote characters:
re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE)
giving:
>>> import re
>>> patt = re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE)
>>> patt.findall('<a href="link">text</a> <a href="correctLink">See full summary</a>')
['correctLink']
Better yet, use a proper HTML parser.
Using BeautifulSoup, finding that link would be as easy as:
soup.find('a', text='See full summary')['href']
for an exact text match:
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup('<a href="link">text</a> <a href="correctLink">See full summary</a>')
>>> soup.find('a', text='See full summary')['href']
u'correctLink'
Upvotes: 1