Simon Steinberger
Simon Steinberger

Reputation: 6825

Python regex to strip html a tags without href attribute

I'm having a string that has been cleaned with lxml's Cleaner, so all links are now in the form Content. Now I'd like to strip out all links that have no href attribute, e.g.

<a rel="nofollow">Link to be removed</a>

should become

Link to be removed

The same for:

<a>Other link to be removed</a>

Shoudl become:

Other link to be removed

Simply all links with a missing href attribute. It doesn't have to be regex, but since lxml returns a clean markup structure, it should be possible. What I need, is a source string stripped of such non-functional a tags.

Upvotes: 0

Views: 1285

Answers (2)

falsetru
falsetru

Reputation: 369274

Use drop_tag method.

import lxml.html

root = lxml.html.fromstring('<div>Test <a rel="nofollow">Link to be <b>removed</b></a>. <a href="#">link</a>')
for a in root.xpath('a[not(@href)]'):
    a.drop_tag()

assert lxml.html.tostring(root) == '<div>Test Link to be <b>removed</b>. <a href="#">link</a></div>'

http://lxml.de/lxmlhtml.html

.drop_tag(): Drops the tag, but keeps its children and text.

Upvotes: 1

TerryA
TerryA

Reputation: 60004

You can use BeautifulSoup, which will make it easier to find <a> tags without a href:

>>> from bs4 import BeautifulSoup as BS
>>> html = """
... <a rel="nofollow">Link to be removed</a>
... <a href="alink">This should not be included</a>
... <a>Other link to be removed</a>
... """
>>> soup = BS(html)
>>> for i in soup.find_all('a', href=False):
...     i.replace_with(i.text)
... 
>>> print soup
<html><body>Link to be removed
<a href="alink">This should not be included</a>
Other link to be removed</body></html>

Upvotes: 2

Related Questions