DragonKnight
DragonKnight

Reputation: 1870

how to look behind in regex without matching a pattern itself?

Lets say we want to extract the link in a tag like this:

input:

<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>

desired output:

http://www.google.com/home/etc

the first solution is to find the match with reference using this href=[\'"]?([^\'" >]+) regex but what I want to achieve is to match the link followed by href. so trying this (?=href\")... (lookahead assertion: matches without consuming) is still matching the href itself.

It is a regex only question.

Upvotes: 1

Views: 64

Answers (3)

Jan
Jan

Reputation: 43179

Make yourself comfortable with a parser, e.g. with BeautifulSoup.
With this, it could be achieved with

from bs4 import BeautifulSoup

html = """<p><a href="http://www.google.com/home/etc"><b>some text</b></a></p>"""

soup = BeautifulSoup(html, "html5lib")
print(soup.find('a').text)
# some text

BeautifulSoup supports a number of selectors including CSS selectors.

Upvotes: 0

Neb
Neb

Reputation: 2280

A solution could be:

(?:href=)('|")(.*)\1

(?:href=) is a non capturing group. It means that the parser use href during the matching but it actually does not return it. As a matter of fact if you try this in regex you will see there's no group holding it.

Besides, every time you open and close a round bracket, you create a group. As a consequence, ('|") defines the group #1 and the URL you want will be in group #2. The way you retrieve this info depends on the programming language.

At the end, the \1 returns the value hold by group #1 (in this case it will be ") to provide a delimiter to the URL

Upvotes: 1

user2390182
user2390182

Reputation: 73490

One of many regex based solutions would be a capturing group:

>>> re.search(r'href="([^"]*)"', s).group(1)
'http://www.google.com/home/etc'

[^"]* matches any number non-".

Upvotes: 2

Related Questions