Geo
Geo

Reputation: 96827

Why is BeautifulSoup throwing this HTMLParseError?

I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:


Traceback (most recent call last):
  File "mx.py", line 7, in 
    s = BeautifulSoup(content)
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1230, in __init__
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1263, in _feed
  File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\Python26\lib\HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "C:\Python26\lib\HTMLParser.py", line 314, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "C:\Python26\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"", at line 258, column 34

Shouldn't it be able to handle this sort of stuff? If it can handle them, how could I do it? If not, is there a module that can handle malformed documents?

EDIT: here's an update. I saved the page locally, using firefox, and I tried to create a soup object from the contents of the file. That's where BeautifulSoup fails. If I try to create a soup object directly from the website, it works.Here's the document that causes trouble for soup.

Upvotes: 1

Views: 2886

Answers (3)

John Machin
John Machin

Reputation: 82934

The problem appears to be the
contents = contents.replace(/</g, '&lt;');
in line 258 plus the similar
contents = contents.replace(/>/g, '&gt;');
in the next line.

I'd just use re.sub to clobber all occurrences of r"replace(/[<>]/" with something inocuous before feeding it to BeautifulSoup ... moving away from BeautifulSoup would be like throwing out the baby with the bathwater IMHO.

Upvotes: 1

Kenan Banks
Kenan Banks

Reputation: 211980

Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.

If you want to stick with your current version, I suggest removing the large <script> block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.

Upvotes: 5

Phil Kulak
Phil Kulak

Reputation: 7220

In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.

Upvotes: 1

Related Questions