Reputation: 4441
I'm using Beauitful soup framework to retreive the link (href from the below html content)
<div class="store">
<label>Store</label>
<span>
<a title="Open in Google Play" href="https://play.google.com/store/apps/details?id=com.opera.mini.android" target="_blank">
<!-- ><span class="ui-icon app-store-gp"></span> -->
Google Play
</a><i class="icon-external-link"></i>
</span>
</div>
I used the following code to retrieve this in python:
pageFile = urllib.urlopen("appannie.com/apps/google-play/app/com.opera.mini.android")
pageHtml = pageFile.read()
pageFile.close()
print pageHtml
soup = BeautifulSoup("".join(pageHtml))
item = soup.find("a", {"title":"Open in Google Play"})
print item
I get NoneType as the output. Any help would be really great.
I printed out the html page and the output was as follows:
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body bgcolor="white">
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx</center>
</body>
</html>
It works fine on the browser
Upvotes: 0
Views: 353
Reputation:
item = soup.find("a", {"title":"Open in Google Play"})
You were initially searching for a "span" with a title "Open in Google Play", however the element that you're looking for is an "a" (a link).
Edit: since it appears that the server returns a 503 error, try setting a common user-agent with this code (not tested, it may not work at all; you'll need to import urllib2
) :
soup = BeautifulSoup(urllib2.urlopen(urllib2.Request(sampleURL, None, {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0"})).read())
item = soup.find("a", {"title":"Open in Google Play"})
print item
Also I removed the useless "".join(pageHtml)
since urllib2 already returns strings so there's no need for join.
Upvotes: 3