Reputation: 2875
I've the following HTML content :
content = """
<div>
<div> <div>div A</div> </div>
<p>P A</p>
<div> <div>div B</div> </div>
<p> P B1</p>
<p> P B2</p>
<div> <div>div C</div> </div>
<p> P C1 <div>NODE</div> </p>
</div>
"""
Which can be seen like that (Not sure if it helps but I like diagram) :
If I use the following code :
soup = bs4.BeautifulSoup(content, "lxml")
firstDiv = soup.div
allElem = firstDiv.findAll( recursive = False)
for i, el in enumerate(allElem):
print "element ", i , " : ", el
I get this :
element 0 : <div> <div>div A</div> </div>
element 1 : <p>P A</p>
element 2 : <div> <div>div B</div> </div>
element 3 : <p> P B1</p>
element 4 : <p> P B2</p>
element 5 : <div> <div>div C</div> </div>
element 6 : <p> P C1 </p>
element 7 : <div>NODE</div>
As you can see unlike elements 0, 2 or 5, the element 6 doesn't contains its children. If I change its <p>
to <b>
or <div>
then it acts as excepted. Why this little difference with <p>
? I'm still having that problem (if this is one?) upgrading from 4.3.2 to 4.4.6.
Upvotes: 2
Views: 62
Reputation: 298166
p
elements can only contain phrasing content so what you have is actually invalid HTML. Here's an example of how it's parsed:
For example, a
form
element isn't allowed inside phrasing content, because when parsed as HTML, aform
element's start tag will imply ap
element's end tag. Thus, the following markup results in two paragraphs, not one:<p>Welcome. <form><label>Name:</label> <input></form>
It is parsed exactly like the following:
<p>Welcome. </p><form><label>Name:</label> <input></form>
You can confirm that this is how browsers parse your HTML (pictured is Chrome 64):
lxml
is handling this correctly, as is html5lib
. html.parser
doesn't implement much of the HTML5 spec and doesn't care about these quirks.
I suggest you stick to lxml
and html5lib
if you don't want to be frustrated in the future by these parsing differences. It's annoying when what you see in your browser's DOM inspector differs from how your code parses it.
Upvotes: 4