Reputation: 1870
Lets say we want to extract the link in a tag like this:
input:
<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>
desired output:
http://www.google.com/home/etc
the first solution is to find the match with reference using this href=[\'"]?([^\'" >]+)
regex
but what I want to achieve is to match the link followed by href. so trying this (?=href\")...
(lookahead assertion: matches without consuming) is still matching the href
itself.
It is a regex only question.
Upvotes: 1
Views: 64
Reputation: 43179
Make yourself comfortable with a parser, e.g. with BeautifulSoup
.
With this, it could be achieved with
from bs4 import BeautifulSoup
html = """<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>"""
soup = BeautifulSoup(html, "html5lib")
print(soup.find('a').text)
# some text
BeautifulSoup
supports a number of selectors including CSS selectors.
Upvotes: 0
Reputation: 2280
A solution could be:
(?:href=)('|")(.*)\1
(?:href=)
is a non capturing group. It means that the parser use href during the matching but it actually does not return it. As a matter of fact if you try this in regex you will see there's no group holding it.
Besides, every time you open and close a round bracket, you create a group. As a consequence, ('|")
defines the group #1 and the URL you want will be in group #2. The way you retrieve this info depends on the programming language.
At the end, the \1
returns the value hold by group #1 (in this case it will be "
) to provide a delimiter to the URL
Upvotes: 1
Reputation: 73490
One of many regex based solutions would be a capturing group:
>>> re.search(r'href="([^"]*)"', s).group(1)
'http://www.google.com/home/etc'
[^"]*
matches any number non-".
Upvotes: 2