Reputation: 1592

remove all data attributes with etree from all elements

So I'm attempting to clean some HTML. I've got the following function:

def clean_html(self, html):
    replaced_html = html.decode('utf-8').replace('<', ' <')

    tree = etree.HTML(replaced_html)
    etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')

    for el in tree.xpath('//*[@style]'):
        el.attrib.pop('style')

    for el in tree.xpath('//*[@class]'):
        el.attrib.pop('class')

    for el in tree.xpath('//*[@id]'):
        el.attrib.pop('id')

    etree.strip_tags(tree, etree.Comment)
    return etree.tostring(tree, encoding='unicode', method='html')

I'm looking to also remove all data-attributes e.g

<li data-direction="ltr" '
         'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
         'data-state="menu idle link notMobile">sky</li>

But the attributes are unknown to me (above is just an example).

So I'm looking to transform the above into just <li>sky</li> and would run on every element on the page.

In my code above I'm able to remove simple things like id, class but I'm not sure how to handle the dynamic attributes data-*. Possibly regex?

EDIT

I should clarify a bit about the input. My example above shows the use of <li> tags. But the actual input is the entire html of a page so it would be something like:

<html>
  <ul>
    <li data-i="sdfdsf">something</li>
    <li data-i="dsfd">something</li>
  </ul>
  <p data-para="cvcv">content</p>
 <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank"  > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank"  > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>

Upvotes: 3

Answers (3)

Jack Fleeting

Reputation: 24930

Maybe this is what you're looking for:

from lxml import etree

code = """
 <html>
   <ul>
    <li data-i="sdfdsf">something</li>
    <li data-i="dsfd">something</li>
  </ul>
    <p data-para="cvcv">content</p> 
</html>

"""

xml = etree.XML(code)
elements = list(xml.iter())
for element in elements:
   if len(element.text.strip())>0:
      print('<'+element.tag+'>'+element.text+'</'+element.tag+'>')

Output:

<li>something</li>
<li>something</li>
<p>content</p>

Upvotes: 0

mzjn

Reputation: 50947

Assuming that the names of the "data attributes" always start with "data-", you can remove them like this:

for el in tree.xpath("//*"):
    for attr in el.attrib:
        if attr.startswith("data-"):
            el.attrib.pop(attr)

Upvotes: 3

kubarik

Reputation: 64

you can clear the attributes like this


import re
def strip_attribute(data):
    p = re.compile('data-[^=]*="[^"]*"')
    print(p)
    return p.sub('', data)
print(strip_attribute('with attribute'))

Upvotes: 0

remove all data attributes with etree from all elements

EDIT

Answers (3)

Related Questions