BeautifulSoup (bs4) parsing wrong

Question

Parsing this sample document with bs4, from python 2.7.6:



HTML allows omitting P end-tags.

Like that and this.

And this, too.

What happened?

And can we 
nest a paragraph, too?

Using:

from bs4 import BeautifulSoup as BS
...
tree = BS(fh)

HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those paragraphs until it sees :


 
  
   HTML allows omitting P end-tags.
   

    Like that and this.
    

     And this, too.
     

      What happened?
     
     
      And can we
      

       nest a paragraph, too?

It's not prettify()'s fault, because traversing the tree manually I get the same structure:

<[document]>
    
        ␊
        
            ␊
            
                HTML allows omitting P end-tags.␊␊
                

                    Like that and this.␊␊
                    

                        And this, too.␊␊
                        

                            What happened?
                        
                        ␊
                        
                            And can we 
                            

                                nest a paragraph, too?
                            
                        
                        ␊
                    
                
            
        
        ␊
    
    ␊

Now, this would be the right result for XML (at least up to , at which point it should report a WF error). But this ain't XML. What gives?

TextGeek · Accepted Answer

The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6.

Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces the correct result:

tree = BS(htmSource, "html5lib")

BeautifulSoup (bs4) parsing wrong

Answers (1)

Related Questions