jumbopap
jumbopap

Reputation: 4147

BeautifulSoup not properly parsing script text/template

I have a fairly complex template script that BeautifulSoup4 isn't understanding for some reason. As you can see below, BS4 is only parsing partially into the tree before giving up. Why is this and is there a way to fix it?

>>> from bs4 import BeautifulSoup
>>> html = """<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script> Other stuff I want to stay"""
>>> soup = BeautifulSoup(html)
>>> soup.findAll('script')
[<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</script>]

Edit: on further testing, for some reason it appears that BS3 is able to parse this correctly:

>>> from BeautifulSoup import BeautifulSoup as bs3
>>> soup = bs3(html)
>>> soup.script
<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script>

Upvotes: 0

Views: 1159

Answers (1)

Victor Sigler
Victor Sigler

Reputation: 23459

Beautiful Soup sometimes fail with its default parser. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers.

In some cases I have to change the parser to other like : lxml, html5lib or any other.

This is a example of the explanation above :

from bs4 import BeautifulSoup    

soup = BeautifulSoup(markup, "lxml")

I recommend you read this http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Upvotes: 1

Related Questions