Python regex to strip html a tags without href attribute

Question

I'm having a string that has been cleaned with lxml's Cleaner, so all links are now in the form Content. Now I'd like to strip out all links that have no href attribute, e.g.

Link to be removed

should become

Link to be removed

The same for:

Other link to be removed

Shoudl become:

Other link to be removed

Simply all links with a missing href attribute. It doesn't have to be regex, but since lxml returns a clean markup structure, it should be possible. What I need, is a source string stripped of such non-functional a tags.

falsetru · Accepted Answer

Use drop_tag method.

import lxml.html

root = lxml.html.fromstring('Test Link to be removed. link')
for a in root.xpath('a[not(@href)]'):
    a.drop_tag()

assert lxml.html.tostring(root) == 'Test Link to be removed. link'

http://lxml.de/lxmlhtml.html

.drop_tag(): Drops the tag, but keeps its children and text.

Python regex to strip html a tags without href attribute

Answers (2)

Related Questions