snoob dogg
snoob dogg

Reputation: 2875

BeautifulSoup : Weird behavior with <p>

I've the following HTML content :

content  = """
<div>

  <div> <div>div A</div> </div>
  <p>P A</p>

  <div> <div>div B</div> </div>   
  <p> P B1</p>
  <p> P B2</p>

  <div> <div>div C</div> </div>
  <p> P C1 <div>NODE</div> </p>

</div>
"""

Which can be seen like that (Not sure if it helps but I like diagram) : enter image description here

If I use the following code :

soup = bs4.BeautifulSoup(content, "lxml")
firstDiv = soup.div
allElem = firstDiv.findAll( recursive = False)
for i, el in enumerate(allElem):
    print "element ", i , " : ", el

I get this :

element  0  :  <div> <div>div A</div> </div>
element  1  :  <p>P A</p>
element  2  :  <div> <div>div B</div> </div>
element  3  :  <p> P B1</p>
element  4  :  <p> P B2</p>
element  5  :  <div> <div>div C</div> </div>
element  6  :  <p> P C1 </p>
element  7  :  <div>NODE</div>

As you can see unlike elements 0, 2 or 5, the element 6 doesn't contains its children. If I change its <p> to <b> or <div> then it acts as excepted. Why this little difference with <p> ? I'm still having that problem (if this is one?) upgrading from 4.3.2 to 4.4.6.

Upvotes: 2

Views: 62

Answers (1)

Blender
Blender

Reputation: 298166

p elements can only contain phrasing content so what you have is actually invalid HTML. Here's an example of how it's parsed:

For example, a form element isn't allowed inside phrasing content, because when parsed as HTML, a form element's start tag will imply a p element's end tag. Thus, the following markup results in two paragraphs, not one:

<p>Welcome. <form><label>Name:</label> <input></form>

It is parsed exactly like the following:

<p>Welcome. </p><form><label>Name:</label> <input></form>

You can confirm that this is how browsers parse your HTML (pictured is Chrome 64):

Chrome parsing invalid HTML

lxml is handling this correctly, as is html5lib. html.parser doesn't implement much of the HTML5 spec and doesn't care about these quirks.

I suggest you stick to lxml and html5lib if you don't want to be frustrated in the future by these parsing differences. It's annoying when what you see in your browser's DOM inspector differs from how your code parses it.

Upvotes: 4

Related Questions