Match href value with a regular expression

Question

My input is similar to this:

text See full summary

From this string i want to get only correctLink (the link that has See full summary as text) .

I'm working with python, and i tried:

re.compile( 'See full summary', re.DOTALL | re.IGNORECASE )

but the only string i get with findall() is link">text

Martijn Pieters · Accepted Answer

Limit your link pattern to non-quote characters:

re.compile('See full summary', re.DOTALL | re.IGNORECASE)

giving:

>>> import re
>>> patt = re.compile('See full summary', re.DOTALL | re.IGNORECASE)
>>> patt.findall('text See full summary')
['correctLink']

Better yet, use a proper HTML parser.

soup.find('a', text='See full summary')['href']

for an exact text match:

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup('text See full summary')
>>> soup.find('a', text='See full summary')['href']
u'correctLink'

Answers (1)