Reputation: 31189
I'm validating custom HTML from users with html5lib. The problem is the html5lib adds html
, head
and body
tags, which I don't need.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
f = open('/home/user/ex.html')
doc = parser.parse(f)
doc.toxml()
'<html><head/><body><div>\n <a href="http://speedhunters.com">speedhunters.com\n</a></div><a href="http://speedhunters.com">\n</a></body></html>'
This is validated, can be sanitized, but how can I remove or prevent adding these tags to the tree?
I mean exclude replace
using.
Upvotes: 0
Views: 957
Reputation: 35766
It seems that we can use the hidden
property of Tag
s in order to prevent the tag itself from being 'exported' when casting a tag/soup to string/unicode:
>>> from bs4 import BeautifulSoup
>>> html = u"<div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print soup.body.prettify()
<body>
<div>
<footer>
<h3>
foot
</h3>
</footer>
</div>
<div>
foo
</div>
</body>
Essentially, the questioner's goal is to get the entire content of the body
tag without the <body>
wrapper itself. This works:
>>> soup.body.hidden=True
>>> print soup.body.prettify()
<div>
<footer>
<h3>
foot
</h3>
</footer>
</div>
<div>
foo
</div>
I found this by going through the BeautifulSoup source. After calling soup = BeautifulSoup(html)
, the root tag has the internal name '[document]'. By default, only the root tag has hidden==True
. This prevents its name from ending up in any HTML output.
Upvotes: 3
Reputation: 89057
Wow, html5lib has horrible documentation.
Looking through the source, and working on a quick test case, this appears to work:
import html5lib
from html5lib import treebuilders
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
with open('test.html') as test:
doc = parser.parse(test)
for child in doc:
if child.parent.name == "body":
return child.toxml()
It's a bit hackish, but less so than a replace()
.
Upvotes: 1