BeautifulSoup : Weird behavior with

Question

I've the following HTML content :

content  = """


   div A
 
  P A

   div B
    
   P B1
   P B2

   div C
 
   P C1 
NODE
 


"""

Which can be seen like that (Not sure if it helps but I like diagram) :

If I use the following code :

soup = bs4.BeautifulSoup(content, "lxml")
firstDiv = soup.div
allElem = firstDiv.findAll( recursive = False)
for i, el in enumerate(allElem):
    print "element ", i , " : ", el

I get this :

element  0  :   div A
 
element  1  :  P A
element  2  :   div B
 
element  3  :   P B1
element  4  :   P B2
element  5  :   div C
 
element  6  :   P C1 
element  7  :  NODE

As you can see unlike elements 0, 2 or 5, the element 6 doesn't contains its children. If I change its

to or

then it acts as excepted. Why this little difference with
? I'm still having that problem (if this is one?) upgrading from 4.3.2 to 4.4.6.

Blender · Accepted Answer

p elements can only contain phrasing content so what you have is actually invalid HTML. Here's an example of how it's parsed:

For example, a form element isn't allowed inside phrasing content, because when parsed as HTML, a form element's start tag will imply a p element's end tag. Thus, the following markup results in two paragraphs, not one:
Welcome. 
Name: 
It is parsed exactly like the following:
Welcome. 
Name: 

You can confirm that this is how browsers parse your HTML (pictured is Chrome 64):

lxml is handling this correctly, as is html5lib. html.parser doesn't implement much of the HTML5 spec and doesn't care about these quirks.

I suggest you stick to lxml and html5lib if you don't want to be frustrated in the future by these parsing differences. It's annoying when what you see in your browser's DOM inspector differs from how your code parses it.

BeautifulSoup : Weird behavior with <p>

Answers (1)

Related Questions

BeautifulSoup : Weird behavior with &lt;p&gt;

Answers (1)

Related Questions

BeautifulSoup : Weird behavior with <p>