Reputation: 1592
So I'm attempting to clean some HTML. I've got the following function:
def clean_html(self, html):
replaced_html = html.decode('utf-8').replace('<', ' <')
tree = etree.HTML(replaced_html)
etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')
for el in tree.xpath('//*[@style]'):
el.attrib.pop('style')
for el in tree.xpath('//*[@class]'):
el.attrib.pop('class')
for el in tree.xpath('//*[@id]'):
el.attrib.pop('id')
etree.strip_tags(tree, etree.Comment)
return etree.tostring(tree, encoding='unicode', method='html')
I'm looking to also remove all data-attributes
e.g
<li data-direction="ltr" '
'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
'data-state="menu idle link notMobile">sky</li>
But the attributes are unknown to me (above is just an example).
So I'm looking to transform the above into just <li>sky</li>
and would run on every element on the page.
In my code above I'm able to remove simple things like id
, class
but I'm not sure how to handle the dynamic attributes data-*
. Possibly regex?
I should clarify a bit about the input. My example above shows the use of <li>
tags. But the actual input is the entire html of a page so it would be something like:
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
<div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>
Upvotes: 3
Views: 949
Reputation: 24930
Maybe this is what you're looking for:
from lxml import etree
code = """
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
</html>
"""
xml = etree.XML(code)
elements = list(xml.iter())
for element in elements:
if len(element.text.strip())>0:
print('<'+element.tag+'>'+element.text+'</'+element.tag+'>')
Output:
<li>something</li>
<li>something</li>
<p>content</p>
Upvotes: 0
Reputation: 50947
Assuming that the names of the "data attributes" always start with "data-", you can remove them like this:
for el in tree.xpath("//*"):
for attr in el.attrib:
if attr.startswith("data-"):
el.attrib.pop(attr)
Upvotes: 3
Reputation: 64
you can clear the attributes like this
import re
def strip_attribute(data):
p = re.compile('data-[^=]*="[^"]*"')
print(p)
return p.sub('', data)
print(strip_attribute('with attribute'))
Upvotes: 0