Reputation: 1
i'm trying to write some python codes to gather music charts data from official websites, but i get in trouble when gathering billboard's data. i choose beautifulsoup to handle the HTML
my ENV: python-2.7 beautifulsoup-3.2.0
first i analysis the HTML
>>> import BeautifulSoup, urllib2, re
>>> html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
>>> soup = BeautifulSoup.BeautifulSoup(html)
then i try to gather data what i want, e.g., the artist name
HTML:
<div class="listing chart_listing">
<article id="node-1491420" class="song_review no_category chart_albumTrack_detail no_divider">
<header>
<span class="chart_position position-down">11</span>
<h1>Ho Hey</h1>
<p class="chart_info">
<a href="/artist/418560/lumineers">The Lumineers</a> <br>
The Lumineers </p>
artist name is The Lumineers
>>> print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')})\
... .find("p", {"class":"chart_info"}).a.string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'find'
NoneType! seems it can't grep the data what i want, maybe my rule is wrong, so i try to grep some basic tag instead.
>>> print str(soup.find("div"))
None
>>> print str(soup.find("a"))
None
>>> print str(soup.find("title"))
<title>The Hot 100 : Page 2 | Billboard</title>
>>> print str(soup)
......entire HTML.....
i'm confusing, why can't it grep the basic tag like div, a? they indeed there. what's wrong with my codes? there is nothing wrong when i try to analysis other chart with these.
Upvotes: 0
Views: 225
Reputation: 76614
This seems to be a Beautifulsoup 3 issue. If you prettify() the output:
from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re
html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup3(html)
print soup.prettify()
you can see at the end of the output:
<script type="text/javascript" src="//assets.pinterest.com/js/pinit.js"></script>
</body>
</html>
</script>
</head>
</html>
With two html end tags, it looks like BeautifulSoup3 is confused by the Javascript stuff in this data.
If you use:
from bs4 import BeautifulSoup as soup4
import urllib2, re
html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup4(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)
You get 'The Lumineers'
as output.
If you cannot switch to bs4, I suggest you write out the html variable to a file out.txt
, then change the script to read in in.txt
and copy the output to the input and cutting away chunks.
from BeautifulSoup import BeautifulSoup as soup3
import re
html = open('in.txt').read()
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)
My first guess was to remove the <head> ... </head>
and that worked wonders.
After that you can solve that programmatically:
from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re
htmlorg = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
head_start = htmlorg.index('<head')
head_end = htmlorg.rindex('</head>')
head_end = htmlorg.index('>', head_end)
html = htmlorg[:head_start] + htmlorg[head_end+1:]
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)
Upvotes: 1