Reputation: 96827
I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:
Traceback (most recent call last):
File "mx.py", line 7, in
s = BeautifulSoup(content)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1230, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1263, in _feed
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "C:\Python26\lib\HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"", at line 258, column 34
Shouldn't it be able to handle this sort of stuff? If it can handle them, how could I do it? If not, is there a module that can handle malformed documents?
EDIT: here's an update. I saved the page locally, using firefox, and I tried to create a soup object from the contents of the file. That's where BeautifulSoup fails. If I try to create a soup object directly from the website, it works.Here's the document that causes trouble for soup.
Upvotes: 1
Views: 2886
Reputation: 82934
The problem appears to be the
contents = contents.replace(/</g, '<');
in line 258 plus the similar
contents = contents.replace(/>/g, '>');
in the next line.
I'd just use re.sub to clobber all occurrences of r"replace(/[<>]/" with something inocuous before feeding it to BeautifulSoup ... moving away from BeautifulSoup would be like throwing out the baby with the bathwater IMHO.
Upvotes: 1
Reputation: 211980
Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.
If you want to stick with your current version, I suggest removing the large <script>
block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.
Upvotes: 5
Reputation: 7220
In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.
Upvotes: 1