user1266094
user1266094

Reputation:

Match href value with a regular expression

My input is similar to this:

<a href="link">text</a> <a href="correctLink">See full summary</a>

From this string i want to get only correctLink (the link that has See full summary as text) .

I'm working with python, and i tried:

re.compile( '<a href="(.*?)">See full summary</a>', re.DOTALL | re.IGNORECASE )

but the only string i get with findall() is link">text</a> <a href="correctLink.

Where is my mistake?

Upvotes: 0

Views: 385

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122552

Limit your link pattern to non-quote characters:

re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE)

giving:

>>> import re
>>> patt = re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE)
>>> patt.findall('<a href="link">text</a> <a href="correctLink">See full summary</a>')
['correctLink']

Better yet, use a proper HTML parser.

Using BeautifulSoup, finding that link would be as easy as:

soup.find('a', text='See full summary')['href']

for an exact text match:

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup('<a href="link">text</a> <a href="correctLink">See full summary</a>')
>>> soup.find('a', text='See full summary')['href']
u'correctLink'

Upvotes: 1

Related Questions