amit.shipra
amit.shipra

Reputation: 197

BeautifulSoup (bs4): How to ignore ending tag in malformed HTML

I am using Beautifulsoup (bs4) to scrap HTML page. It has list <ul> which has <li> that holds some interesting link (href).

Snippet:

<ul>
 <!-- C 1-3 --></p>
 <li>
   <a href="http://LINK1" target="_blank">Link1 description</a>
 </li>
</ul>

<ul>
 <!-- E 1-2-3-6 --></p>
 <li>
  <a href="LINK-2" target="_blank">Link-2 description</a>
 </li>
 <p><!-- E 4-5 -7-8-9-10-11 --></p>
</ul>

Problem: When I use find_all() to extract all the <ul> - I am not getting it due the malformed ending </p> which has missing opening <p>. Browser ignores this and renders ok but BS4 messes up the parsing. Did anyone try to ignore any malformed tags in BS4 if present?

entries = soup.find_all(lambda x: x.name == 'ul')
print(len(entries))
print(entries[0])

1
<ul>
 <!-- C 1-3 --></ul>

Upvotes: 3

Views: 1586

Answers (1)

user212514
user212514

Reputation: 3130

I think you should try a more lenient parser for the HTML. For example:

soup = BeautifulSoup(pg, "html5lib")

For the html5lib parser is the most lenient parser. The advantages are:

  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5

Disadvantages are:

  • Very slow
  • External Python dependency

The documentation offers some explanation of the pros and cons of different parsers: https://beautiful-soup-4.readthedocs.org/en/latest/#installing-a-parser

Upvotes: 8

Related Questions