Jash Yin
Jash Yin

Reputation: 1

a strange issue when trying to analysis HTML with beautifulsoup

i'm trying to write some python codes to gather music charts data from official websites, but i get in trouble when gathering billboard's data. i choose beautifulsoup to handle the HTML

my ENV: python-2.7 beautifulsoup-3.2.0

first i analysis the HTML

>>> import BeautifulSoup, urllib2, re
>>> html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
>>> soup = BeautifulSoup.BeautifulSoup(html)

then i try to gather data what i want, e.g., the artist name

HTML:

<div class="listing chart_listing">

<article id="node-1491420" class="song_review no_category chart_albumTrack_detail no_divider">
  <header>
    <span class="chart_position position-down">11</span>
            <h1>Ho Hey</h1>
        <p class="chart_info">
      <a href="/artist/418560/lumineers">The Lumineers</a>            <br>
      The Lumineers          </p>

artist name is The Lumineers

>>> print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')})\
... .find("p", {"class":"chart_info"}).a.string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'find'

NoneType! seems it can't grep the data what i want, maybe my rule is wrong, so i try to grep some basic tag instead.

>>> print str(soup.find("div"))
None
>>> print str(soup.find("a"))
None
>>> print str(soup.find("title"))
<title>The Hot 100 : Page 2  | Billboard</title>
>>> print str(soup)
......entire HTML.....

i'm confusing, why can't it grep the basic tag like div, a? they indeed there. what's wrong with my codes? there is nothing wrong when i try to analysis other chart with these.

Upvotes: 0

Views: 225

Answers (1)

Anthon
Anthon

Reputation: 76614

This seems to be a Beautifulsoup 3 issue. If you prettify() the output:

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup3(html)
print soup.prettify()

you can see at the end of the output:

        <script type="text/javascript" src="//assets.pinterest.com/js/pinit.js"></script>
</body>
</html>
  </script>
 </head>
</html>

With two html end tags, it looks like BeautifulSoup3 is confused by the Javascript stuff in this data.

If you use:

from bs4 import BeautifulSoup as soup4
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup4(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

You get 'The Lumineers' as output.

If you cannot switch to bs4, I suggest you write out the html variable to a file out.txt, then change the script to read in in.txt and copy the output to the input and cutting away chunks.

from BeautifulSoup import BeautifulSoup as soup3
import re

html = open('in.txt').read()
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

My first guess was to remove the <head> ... </head> and that worked wonders.

After that you can solve that programmatically:

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

htmlorg = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
head_start = htmlorg.index('<head')
head_end = htmlorg.rindex('</head>')
head_end = htmlorg.index('>', head_end)
html = htmlorg[:head_start] + htmlorg[head_end+1:]
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

Upvotes: 1

Related Questions