Reputation: 4690
I'm running the following script:
from bs4 import BeautifulSoup
import urllib2
import sys
print sys.version
url = 'https://www.google.com/finance'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
trends_tag = soup.find('div', {'id': 'topmovers'})
tags = trends_tag.find_all('td', 'change chg')
print len(tags)
tag = tags[0]
print 'Tag: ' + tag.text
On my computer, the output is:
2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%
On the EC2 machine, the output is:
2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
11
Tag: 33.24%
12.18B
CLX
The Clorox Co
7.35%
11.67B
THOR
Thoratec Corporation
6.12%
1.47B
FOE
Ferro Corporation
6.03%
1.17B
NORD
Nord Anglia Education Inc
5.88%
1.70B
LosersChange
Mkt Cap
CRR
CARBO Ceramics Inc.
-16.10%
1.95B
CMCT
CIM Commercial Trust Corp
-10.54%
1.84B
HLF
Herbalife Ltd.
-10.31%
4.11B
INVN
InvenSense Inc
-10.10%
2.08B
TRS
TriMas Corp
-9.99%
1.34B
I've updated both machines to the same python version. The installed packages are a bit different though. My machine:
>pip freeze
PIL==1.1.7
beautifulsoup4==4.3.2
colorama==0.3.1
cssselect==0.9.1
frida==1.6.0
lxml==3.4.0
newspaper==0.0.7
numpy==1.8.1
pefile==1.2.10-139
pudb==2013.5.1
pygments==1.6
requests==2.4.1
scikit-learn==0.15-git
urwid==1.2.0
xlrd==0.9.2
xlwt==0.7.5
The EC2 machine:
>pip freeze
beautifulsoup4==4.3.2
It seems that the find_all returns a tag much bigger than what it should be. Also, when running print tags[0]
I get:
My machine:
<td class="change chg">33.24%
</td>
On the EC2 Machine:
<td class="change chg">33.24%
<td class="mktCap">12.18B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:CLX&ei=lkwhVJDfJKjeiALmvYHACA" title="CLX">CLX</a>
<td class="name">
<a href="/finance?q=NYSE:CLX&ei=lkwhVJDfJKjeiALmvYHACA">The Clorox Co</a>
<td class="change chg">7.35%
<td class="mktCap">11.67B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:THOR&ei=lkwhVJDfJKjeiALmvYHACA" title="THOR">THOR
</a>
<td class="name">
<a href="/finance?q=NASDAQ:THOR&ei=lkwhVJDfJKjeiALmvYHACA">Thoratec Corporat
ion</a>
<td class="change chg">6.12%
<td class="mktCap">1.47B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:FOE&ei=lkwhVJDfJKjeiALmvYHACA" title="FOE">FOE</a>
<td class="name">
<a href="/finance?q=NYSE:FOE&ei=lkwhVJDfJKjeiALmvYHACA">Ferro Corporation</a
>
<td class="change chg">6.03%
<td class="mktCap">1.17B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:NORD&ei=lkwhVJDfJKjeiALmvYHACA" title="NORD">NORD</
a>
<td class="name">
<a href="/finance?q=NYSE:NORD&ei=lkwhVJDfJKjeiALmvYHACA">Nord Anglia Educati
on Inc</a>
<td class="change chg">5.88%
<td class="mktCap">1.70B
<tr><td style="height:.7em">
<tr class="colHeader">
<td class="title chr">Losers<td class="change">Change
<td class="mktCap">Mkt Cap
</td></td></td></tr>
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:CRR&ei=lkwhVJDfJKjeiALmvYHACA" title="CRR">CRR</a>
<td class="name">
<a href="/finance?q=NYSE:CRR&ei=lkwhVJDfJKjeiALmvYHACA">CARBO Ceramics Inc.<
/a>
<td class="change chr">-16.10%
<td class="mktCap">1.95B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:CMCT&ei=lkwhVJDfJKjeiALmvYHACA" title="CMCT">CMCT
</a>
<td class="name">
<a href="/finance?q=NASDAQ:CMCT&ei=lkwhVJDfJKjeiALmvYHACA">CIM Commercial Tr
ust Corp</a>
<td class="change chr">-10.54%
<td class="mktCap">1.84B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:HLF&ei=lkwhVJDfJKjeiALmvYHACA" title="HLF">HLF</a>
<td class="name">
<a href="/finance?q=NYSE:HLF&ei=lkwhVJDfJKjeiALmvYHACA">Herbalife Ltd.</a>
<td class="change chr">-10.31%
<td class="mktCap">4.11B
<tr>
<td class="symbol">
<a href="/finance?q=NYSE:INVN&ei=lkwhVJDfJKjeiALmvYHACA" title="INVN">INVN</
a>
<td class="name">
<a href="/finance?q=NYSE:INVN&ei=lkwhVJDfJKjeiALmvYHACA">InvenSense Inc</a>
<td class="change chr">-10.10%
<td class="mktCap">2.08B
<tr>
<td class="symbol">
<a href="/finance?q=NASDAQ:TRS&ei=lkwhVJDfJKjeiALmvYHACA" title="TRS">TRS</a
>
<td class="name">
<a href="/finance?q=NASDAQ:TRS&ei=lkwhVJDfJKjeiALmvYHACA">TriMas Corp</a>
<td class="change chr">-9.99%
<td class="mktCap">1.34B
<tr><td style="height:.7em">
</td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td>
</tr></td></td></td></td></tr></td></td></td></td></tr></td></tr></td></td></td>
</td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td>
</tr></td></td>
Notice the </td></tr>
at the end - Like it merges the branches for some reason.
What can cause such a difference?
Sorry for the long question
Upvotes: 1
Views: 1218
Reputation: 1122382
The difference is lxml
. BeautifulSoup uses lxml
as the default parser when installed, with a fallback to the standard library HTMLParser
module when it is not.
Your input HTML is malformed, and parsers are allowed to 'make the best of it' when presented with such HTML. lxml
and HTMLParser
use different approaches to how to repair the HTML.
You can force BeautifulSoup to use a specific parser by naming it in a second argument when creating the BeautifulSoup()
instance, see Specifying a parser to use:
soup = BeautifulSoup(page, 'html.parser')
Upvotes: 5