Reputation: 9004
I am trying to parse arbitrary documents download from the wild web, and yes, I have no control of their content.
Since Beautiful Soup won't choke if you give it bad markup... I wonder why does it giving me those hick-ups when sometimes, part of the doc is malformed, and whether there is a way to make it resume to next readable portion of the doc, regardless of this error.
The line where the error occurred is the 3rd one:
from BeautifulSoup import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)
CLI full output is:
Traceback (most recent call last):
File "./grablinks", line 101, in <module>
sys.exit(main())
File "./grablinks", line 88, in main
links = grab_links(options)
File "./grablinks", line 36, in grab_links
doc = doc_parser(reader)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
Upvotes: 5
Views: 948
Reputation: 536567
Yeah, It will choke if you have elements with non-ASCII names (<café>
). And that's not even ‘bad markup’, for XML...
It's a bug in sgmllib
which BeautifulSoup is using: it tries to find custom methods with the same names as tags, but in Python 2 method names are byte strings so even looking for a method with a non-ASCII character in, which will never be present, fails.
You can hack a fix into sgmllib by changing lines 259 and 371 from except AttributeError:
to except AttributeError, UnicodeError:
but that's not really a good fix. Not trivial to override the rest of the method either.
What is it you're trying to parse? BeautifulStoneSoup was always of questionable usefulness really—XML doesn't have the wealth of ghastly parser hacks that HTML does, so in general broken XML isn't XML. Consequently you should generally use a plain old XML parser (eg use a standard DOM or etree). For parsing general HTML, html5lib
is your better option these days.
Upvotes: 2
Reputation: 6317
This happens if there are non-ascii chars in the input in python versions before Python 3.0
If you are trying to use str(...)
on a string containing chars with a char value > 128 (ANSII & unicode), this exception is raised.
Here, the error possibly occurs because getattr
tries to use str
on a unicode string - it "thinks" it can safely do this because in python versions prior to 3.0 identifiers must not contain unicode.
Check your HTML for unicode characters. Try to replace / encode these and if it still does not work, tell us.
Upvotes: 0