Reputation: 6825
I'm having a string that has been cleaned with lxml's Cleaner, so all links are now in the form Content. Now I'd like to strip out all links that have no href attribute, e.g.
<a rel="nofollow">Link to be removed</a>
should become
Link to be removed
The same for:
<a>Other link to be removed</a>
Shoudl become:
Other link to be removed
Simply all links with a missing href attribute. It doesn't have to be regex, but since lxml returns a clean markup structure, it should be possible. What I need, is a source string stripped of such non-functional a tags.
Upvotes: 0
Views: 1285
Reputation: 369274
Use drop_tag
method.
import lxml.html
root = lxml.html.fromstring('<div>Test <a rel="nofollow">Link to be <b>removed</b></a>. <a href="#">link</a>')
for a in root.xpath('a[not(@href)]'):
a.drop_tag()
assert lxml.html.tostring(root) == '<div>Test Link to be <b>removed</b>. <a href="#">link</a></div>'
.drop_tag(): Drops the tag, but keeps its children and text.
Upvotes: 1
Reputation: 60004
You can use BeautifulSoup
, which will make it easier to find <a>
tags without a href
:
>>> from bs4 import BeautifulSoup as BS
>>> html = """
... <a rel="nofollow">Link to be removed</a>
... <a href="alink">This should not be included</a>
... <a>Other link to be removed</a>
... """
>>> soup = BS(html)
>>> for i in soup.find_all('a', href=False):
... i.replace_with(i.text)
...
>>> print soup
<html><body>Link to be removed
<a href="alink">This should not be included</a>
Other link to be removed</body></html>
Upvotes: 2