I159
I159

Reputation: 31189

html5lib. How to get valid html without adding html, head and body tags?

I'm validating custom HTML from users with html5lib. The problem is the html5lib adds html, head and body tags, which I don't need.

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
f = open('/home/user/ex.html')
doc = parser.parse(f)
doc.toxml()
'<html><head/><body><div>\n  <a href="http://speedhunters.com">speedhunters.com\n</a></div><a href="http://speedhunters.com">\n</a></body></html>'

This is validated, can be sanitized, but how can I remove or prevent adding these tags to the tree? I mean exclude replace using.

Upvotes: 0

Views: 957

Answers (3)

Dr. Jan-Philip Gehrcke
Dr. Jan-Philip Gehrcke

Reputation: 35766

It seems that we can use the hidden property of Tags in order to prevent the tag itself from being 'exported' when casting a tag/soup to string/unicode:

>>> from bs4 import BeautifulSoup
>>> html = u"<div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print soup.body.prettify()
<body>
 <div>
  <footer>
   <h3>
    foot
   </h3>
  </footer>
 </div>
 <div>
  foo
 </div>
</body>

Essentially, the questioner's goal is to get the entire content of the body tag without the <body> wrapper itself. This works:

>>> soup.body.hidden=True
>>> print soup.body.prettify()
 <div>
  <footer>
   <h3>
    foot
   </h3>
  </footer>
 </div>
 <div>
  foo
 </div>

I found this by going through the BeautifulSoup source. After calling soup = BeautifulSoup(html), the root tag has the internal name '[document]'. By default, only the root tag has hidden==True. This prevents its name from ending up in any HTML output.

Upvotes: 3

Lucian
Lucian

Reputation: 574

lxml may be a better choice if you're dealing with "uncommon" html.

Upvotes: 1

Gareth Latty
Gareth Latty

Reputation: 89057

Wow, html5lib has horrible documentation.

Looking through the source, and working on a quick test case, this appears to work:

import html5lib
from html5lib import treebuilders
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
with open('test.html') as test:
    doc = parser.parse(test)
    for child in doc:
        if child.parent.name == "body":
            return child.toxml()

It's a bit hackish, but less so than a replace().

Upvotes: 1

Related Questions