Adam
Adam

Reputation: 2361

beautifulsoup configure autoclosing tags

Let's explain my issue by example:

from bs4 import BeautifulSoup                                                                                                                                                                                                    

txt = """                                                                                                                                                                                                                        
<html>                                                                                                                                                                                                                           
<body>                                                                                                                                                                                                                           
<ul>
    <li> 1
    <li> 2
</ul>
</body>
"""

soup = BeautifulSoup(txt)

print(soup.prettify())

Here output of this script:

<html>
 <body>
  <ul>
   <li>
    1
    <li>
     2
    </li>
   </li>
  </ul>
 </body>
</html>

As you can see in the input html li tags were not closed. BeautifulSoup fixed it in some way. But is it possible to configure BeautifulSoup to get this result on the output?

<html>
 <body>
  <ul>
   <li>
    1
   </li>
   <li>
     2
   </li>
  </ul>
 </body>
</html>

Upvotes: 0

Views: 84

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1124968

The 'fixing' is applied by the parser used to load the HTML into the BeautifulSoup object tree.

You can swap out different parsers; broken HTML is repaired in different ways by different parsers. You'll have to install additional packages; by default only the html.parser option is available.

I'd use the html5lib parser here, it'll interpret non-standard HTML the same way a browser would, or you can try the lxml parser:

>>> print BeautifulSoup(txt, 'html5lib').prettify()
<html>
 <head>
 </head>
 <body>
  <ul>
   <li>
    1
   </li>
   <li>
    2
   </li>
  </ul>
 </body>
</html>
>>> print BeautifulSoup(txt, 'lxml').prettify()
<html>
 <body>
  <ul>
   <li>
    1
   </li>
   <li>
    2
   </li>
  </ul>
 </body>
</html>

As you can see, both these produce the desired output.

It's only the default parser that exhibits this problem:

>>> print BeautifulSoup(txt, 'html.parser').prettify()
<html>
 <body>
  <ul>
   <li>
    1
    <li>
     2
    </li>
   </li>
  </ul>
 </body>
</html>

Upvotes: 2

Related Questions