dotancohen
dotancohen

Reputation: 31481

How to prevent webpage from crashing BeautifulSoup?

On Python 3.2.3 running on Kubuntu Linux 12.10 with Requests 0.12.1 and BeautifulSoup 4.1.0, I am having some web pages break on parsing:

try:       
    response = requests.get('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
except Exception as error:
    return False

pprint(str(type(response)));
pprint(response);
pprint(str(type(response.content)));

soup = bs4.BeautifulSoup(response.content)

Note that hundreds of other web pages parse fine. What is in this particular page that is crashing Python, and how can I work around it? Here is the crash:

 - bruno:scraper$ ./test-broken-site.py 
"<class 'requests.models.Response'>"
<Response [200]>
"<class 'bytes'>"
Traceback (most recent call last):
  File "./test-broken-site.py", line 146, in <module>
    main(sys.argv)
  File "./test-broken-site.py", line 138, in main
    has_adsense('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
  File "./test-broken-site.py", line 67, in test_page_parse
    soup = bs4.BeautifulSoup(response.content)
  File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 172, in __init__
    self._feed()
  File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 175, in feed
    self.parser.close()
  File "parser.pxi", line 1171, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:79886)
  File "parsertarget.pxi", line 126, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:88932)
  File "lxml.etree.pyx", line 282, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7469)
  File "saxparser.pxi", line 288, in lxml.etree._handleSaxDoctype (src/lxml/lxml.etree.c:85572)
  File "parsertarget.pxi", line 84, in lxml.etree._PythonSaxParserTarget._handleSaxDoctype (src/lxml/lxml.etree.c:88469)
  File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 150, in doctype
    doctype = Doctype.for_name_and_ids(name, pubid, system)
  File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
    return Doctype(value)
  File "/usr/lib/python3/dist-packages/bs4/element.py", line 653, in __new__
    return str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
TypeError: coercing to str: need bytes, bytearray or buffer-like object, NoneType found

Instead of bs4.BeautifulSoup(response.content) I had tried bs4.BeautifulSoup(response.text). This had the same result (same crash on this page). What can I do to work around pages that break like this, so that I could parse them?

Upvotes: 1

Views: 799

Answers (1)

XapaJIaMnu
XapaJIaMnu

Reputation: 1500

The website provided in your output has doctype:

<!DOCTYPE>

Whereas a proper site has to have something like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

When the beautifulsoup parser tries to get the doctype here:

File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)

The value of Doctype is empty, and then when that value is attempted to be used, the parser fails.

One solution is to manually fix the problem with regex, before parsing the page to beautifulsoup

Upvotes: 1

Related Questions