Reputation: 6087
YCombinator is nice enough to provide an RSS feed and a big RSS feed containing the top items on HackerNews. I am trying to write a python script to access the RSS feed document and then parse out certain pieces of information using BeautifulSoup. However, I am getting some strange behavior when BeautifulSoup tries to get the content of each of the items.
Here are a few sample lines of the RSS feed:
<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'Notch'</title>
<link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
<comments>http://news.ycombinator.com/item?id=4944322</comments>
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
<link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
<comments>http://news.ycombinator.com/item?id=4943361</comments>
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>
Here is the code I have written (in python) to access this feed and print out the title
, link
, and comments
for each item:
import sys
import requests
from bs4 import BeautifulSoup
request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
title = item.find('title').text
link = item.find('link').text
comments = item.find('comments').text
print title + ' - ' + link + ' - ' + comments
However, this script is giving output that looks like this:
EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'Notch' - - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) - - http://news.ycombinator.com/item?id=4943361
...
As you can see, the middle item, link
, is somehow being omitted. That is, the resulting value of link
is somehow an empty string. So why is that?
As I dig into what is in soup
, I realize that it is somehow choking when it parses the XML. This can be seen by looking at at the first item in items
:
>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'Notch'</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>
You'll notice that something screwy is happening with just the link
tag. It just gets the close tag and then the text for that tag after it. This is some very strange behavior especially in contrast to title
and comments
being parsed without a problem.
This seems to be a problem with BeautifulSoup because what is actually read in by requests doesn't have any problems with it. I don't think it is limited to BeautifulSoup though because I tried using xml.etree.ElementTree API as well and the same problem arose (is BeautifulSoup built on this API?).
Does anyone know why this would be happening or how I can still use BeautifulSoup without getting this error?
Note: I was finally able to get what I wanted with xml.dom.minidom, but this doesn't seem like a highly recommended library. I would like to continue using BeautifulSoup if possible.
Update: I am on a Mac with OSX 10.8 using Python 2.7.2 and BS4 4.1.3.
Update 2: I have lxml and it was installed with pip. It is version 3.0.2. As far as libxml, I checked in /usr/lib and the one that shows up is libxml2.2.dylib. Not sure when or how that was installed.
Upvotes: 9
Views: 5171
Reputation: 1021
@Yan Hudon is right .I have solved the problem with soup = BeautifulSoup(request.text, 'xml')
Upvotes: 3
Reputation: 41
Actually, the problem seems to be related with the parser you are using. By default, a HTML one is used. Try using soup = BeautifulSoup(request.text, 'xml') after installing the lxml module.
It will then use a XML parser instead of a HTML one and it should be all ok.
See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for more info
Upvotes: 3
Reputation: 365707
I don't think there's a bug in BeautifulSoup here.
I installed a clean copy of BS4 4.1.3 on Apple's stock 2.7.2 from OS X 10.8.2, and everything worked as expected. It doesn't mis-parse the <link>
as </link>
, and therefore it doesn't have the problem with the item.find('link')
.
I also tried using the stock xml.etree.ElementTree
and xml.etree.cElementTree
in 2.7.2, and xml.etree.ElementTree
in python.org 3.3.0, to parse the same thing, and it again worked fine. Here's the code:
import xml.etree.ElementTree as ET
rss = ET.fromstring(x)
for channel in rss.findall('channel'):
for item in channel.findall('item'):
title = item.find('title').text
link = item.find('link').text
comments = item.find('comments').text
print(title)
print(link)
print(comments)
I then installed lxml 3.0.2 (I believe BS uses lxml if available), using Apple's built-in /usr/lib/libxml2.2.dylib
(which, according to xml2-config --version
, is 2.7.8), and did the same tests using its etree, and using BS, and again, everything worked.
In addition to screwing up the <link>
, jdotjdot's output shows that BS4 is screwing up the <description>
in an odd way. The original is this:
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
His output is:
<description>Comments]]></description>
My output from running his exact same code is:
<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
So, it seems like there's a much bigger problem going on here. The odd thing is that it's happening to two different people, when it isn't happening on a clean install of the latest version of anything.
That implies either that it's a bug that's been fixed and I just have a newer version of whatever had the bug, or it's something weird about the way they both installed something.
BS4 itself can be ruled out, since at least Treebranch has 4.1.3 just like me. Although, without knowing how he installed it, it could be a problem with the installation.
Python and its built-in etree can be ruled out, since at least Treebranch has the same stock Apple 2.7.2 from OS X 10.8 as me.
It could very well be a bug with lxml or the underlying libxml, or the way they were installed. I know jdotjdot has lxml 2.3.6, so this could be a bug that's been fixed somewhere between 2.3.6 and 3.0.2. In fact, given that, according to the lxml website and the change notes for any version after 2.3.5, there is no 2.3.6, so whatever he has may be some kind of buggy release from very early on a canceled branch or something… I don't know his libxml version, or how either was installed, or what platform he's on, so it's hard to guess, but at least this is something that can be investigated.
Upvotes: 1
Reputation: 17052
Wow, great question. This strikes me as a bug in BeautifulSoup. The reason that you can't access the link using soup.find_all('item').link
is that when you first load the html into BeautifulSoup to begin with, it does something odd to the HTML:
>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
<comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]></description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
<comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]></description>
</item>
...
</channel>
</rss></body></html>
Look carefully--it has actually changed the first <link>
tag to <link/>
and then removed the </link>
tag. I'm not sure why it would do this, but without fixing the problem in the BeautifulSoup.BeautifulSoup
class initialization, you're not going to be able to use it for now.
I think your best (albeit hack-y) bet for now is to use the following for link
:
>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'
Upvotes: 7