Reputation: 110969
I want to find the text between a pair of <a> tags that link to a given site
Here's the re string that I'm using to find the content:
r'''(<a([^<>]*)href=("|')(http://)?(www\.)?%s([^'"]*)("|')([^<>]*)>([^<]*))</a>''' % our_url
The result will be something like this:
r'''(<a([^<>]*)href=("|')(http://)?(www\.)?stackoverflow.com([^'"]*)("|')([^<>]*)>([^<]*))</a>'''
This is great for most links but it errors with a link with tags within it. I tried changing the final part of the regex from:
([^<]*))</a>'''
to:
(.*))</a>'''
But that just got everything on the page after the link, which I don't want. Are there any suggestions on what I can do to solve this?
Upvotes: 3
Views: 846
Reputation: 21950
Instead of:
[^<>]*
Try:
((?!</a).)*
In other words, match any character that isn't the start of a </a
sequence.
Upvotes: 3
Reputation: 17124
>>> import re
>>> pattern = re.compile(r'<a.+href=[\'|\"](.+)[\'|\"].*?>(.+)</a>', re.IGNORECASE)
>>> link = '<a href="http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there">Finding anchor text when there are tags there</a>'
>>> re.match(pattern, link).group(1)
'http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there'
>>> re.match(pattern, link).group(2)
'Finding anchor text when there are tags there'
Upvotes: 3
Reputation: 351516
I would not use a regex - use an HTML parser like Beautiful Soup.
Upvotes: 2