user1915496
user1915496

Reputation: 65

BeautifulSoup HTMLParseError. What's wrong with this?

This is my code:

from bs4 import BeautifulSoup as BS
import urllib2
url = "http://services.runescape.com/m=news/recruit-a-friend-for-free-membership-and-xp"
res = urllib2.urlopen(url)
soup = BS(res.read())
other_content = soup.find_all('div',{'class':'Content'})[0]
print other_content

Yet an error comes up:

/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.
  "Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
Traceback (most recent call last):
  File "web.py", line 5, in <module>
    soup = BS(res.read())
  File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 172, in __init__
    self._feed()
  File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed
    raise e

I've let two other people use this code, and it works for them perfectly fine. Why is it not working for me? I have bs4 installed...

Upvotes: 4

Views: 2336

Answers (1)

RocketDonkey
RocketDonkey

Reputation: 37259

Per the error message, one thing you may need to do is install lxml, which will provide a more powerful parsing engine for BeautifulSoup to use. See this section in the docs for a better overview, but the likely reason that it works for two other people is that they have lxml (or another parser that handles the HTML properly) installed, meaning that BeautifulSoup uses it instead of the standard built-in (side note: your example works for me as well on a system with lxml installed, but fails on one without it).

Also, see this note in the docs:

If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

I would recommend running sudo apt-get install python-lxml and seeing if the problem continues.

Upvotes: 6

Related Questions