kurupt_89
kurupt_89

Reputation: 1592

remove all data attributes with etree from all elements

So I'm attempting to clean some HTML. I've got the following function:

def clean_html(self, html):
    replaced_html = html.decode('utf-8').replace('<', ' <')

    tree = etree.HTML(replaced_html)
    etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')

    for el in tree.xpath('//*[@style]'):
        el.attrib.pop('style')

    for el in tree.xpath('//*[@class]'):
        el.attrib.pop('class')

    for el in tree.xpath('//*[@id]'):
        el.attrib.pop('id')

    etree.strip_tags(tree, etree.Comment)
    return etree.tostring(tree, encoding='unicode', method='html')

I'm looking to also remove all data-attributes e.g

<li data-direction="ltr" '
         'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
         'data-state="menu idle link notMobile">sky</li>

But the attributes are unknown to me (above is just an example).

So I'm looking to transform the above into just <li>sky</li> and would run on every element on the page.

In my code above I'm able to remove simple things like id, class but I'm not sure how to handle the dynamic attributes data-*. Possibly regex?

EDIT

I should clarify a bit about the input. My example above shows the use of <li> tags. But the actual input is the entire html of a page so it would be something like:

<html>
  <ul>
    <li data-i="sdfdsf">something</li>
    <li data-i="dsfd">something</li>
  </ul>
  <p data-para="cvcv">content</p>
 <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank"  > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank"  > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>

Upvotes: 3

Views: 949

Answers (3)

Jack Fleeting
Jack Fleeting

Reputation: 24930

Maybe this is what you're looking for:

from lxml import etree

code = """
 <html>
   <ul>
    <li data-i="sdfdsf">something</li>
    <li data-i="dsfd">something</li>
  </ul>
    <p data-para="cvcv">content</p> 
</html>

"""

xml = etree.XML(code)
elements = list(xml.iter())
for element in elements:
   if len(element.text.strip())>0:
      print('<'+element.tag+'>'+element.text+'</'+element.tag+'>')

Output:

<li>something</li>
<li>something</li>
<p>content</p>

Upvotes: 0

mzjn
mzjn

Reputation: 50947

Assuming that the names of the "data attributes" always start with "data-", you can remove them like this:

for el in tree.xpath("//*"):
    for attr in el.attrib:
        if attr.startswith("data-"):
            el.attrib.pop(attr)

Upvotes: 3

kubarik
kubarik

Reputation: 64

you can clear the attributes like this


import re
def strip_attribute(data):
    p = re.compile('data-[^=]*="[^"]*"')
    print(p)
    return p.sub('', data)
print(strip_attribute('with attribute'))

Upvotes: 0

Related Questions