Reputation: 1023
I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly.
Here's what I did
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()
I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149
"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
...
HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94
I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?
Upvotes: 5
Views: 12877
Reputation: 1259
One of the Simplest thing you can do is, specify the content as "lxml". you can do it by adding "lxml" to the urlopen() function as a parameter
page = urllib2.urlopen("[url]","lxml")
Then your code will be as follow.
import urllib2from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1","lxml")
soup = BeautifulSoup(page)
print soup.prettify()
So far i didn't get any trouble from this approach :)
Upvotes: 1
Reputation: 597
you need to use html5lib parser with BeautifulSoup
To install the reqd parser use pip:
pip install html5lib
then use that parser this way
import mechanize
br = mechanize.Browser()
html = br.open("http://google.com/",timeout=100).read()
soup = BeautifulSoup(html,'html5lib')
a_s = soup.find_all('a')
for i in range(0,len(a_s)):
print a_s[i]['href']
Upvotes: 2
Reputation: 414149
From the docs:
If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.
Your code works as is (on Python 2.7, Python 3.3) if you install more robust parser on Python 2.7 (such as lxml or html5lib):
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen # py3k
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
url = "http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())
HTMLParser.py - more robust SCRIPT tag parsing bug might be related.
Upvotes: 10
Reputation: 32066
You cannot use BeautifulSoup nor any HTML parser to read web pages. You are never guaranteed that web page is a well formed document. Let me explain what is happening in this given case.
On that page there is this INLINE javascript:
var str="<script src='http://widgets.outbrain.com/outbrainWidget.js'; type='text/javascript'></"+"script>";
You can see that it's creating a string that will put a script tag onto the page. Now, if you're an HTML parser, this is a very tricky thing to deal with. You go along reading your tokens when suddenly you hit a <script>
tag. Now, unfortunately, if you did this:
<script>
alert('hello');
<script>
alert('goodby');
Most parsers would say: ok, I found an open script tag. Oh, I found another open script tag! They must have forgot to close the first one! And the parser would think both are valid scripts.
So, in this case, BeautifulSoup sees a <script>
tag, and even though it's inside a javascript string, it looks like it could be a valid starting tag, and BeautifulSoup has a seizure, as well it should.
If you look at the string again, you can see they do this interesting piece of work:
... "</" + "script>";
This seems odd right? Wouldn't it be better to just do str = " ... </script>"
without doing an extra string concatination? This is actually a common trick (by silly people who write script tags as strings, a bad practice) to make the parser NOT break. Because if you do this:
var a = '</script>';
in an inline script, the parser will come along and really just see </script>
and think the whole script tag has ended, and will throw up the rest of the contents of that script tag onto the page as plain text. This is because you can technically put a closing script tag anywhere, even if your JS syntax is invalid. From a parser point of view, it's better to get out of the script tag early rather than try to render your html code as javascript.
So, you can't use a regular HTML parser to parse web pages. It's a very, very dangerous game. There is no guarantee you'll get well formed HTML. Depending on what you're trying to do, you could read the content of the page with a regex, or try getting a fully rendered page content with a headless browser
Upvotes: 8