Python html parsing using beautifulsoup framework

Question

I'm using Beauitful soup framework to retreive the link (href from the below html content)

         
               Store
                 
                   
                        
                        Google Play

I used the following code to retrieve this in python:

 pageFile = urllib.urlopen("appannie.com/apps/google-play/app/com.opera.mini.android")
 pageHtml = pageFile.read()
 pageFile.close()
 print pageHtml
 soup = BeautifulSoup("".join(pageHtml))
 item = soup.find("a", {"title":"Open in Google Play"})

 print item

I get NoneType as the output. Any help would be really great.

I printed out the html page and the output was as follows:

  
  503 Service Temporarily Unavailable
  
  503 Service Temporarily Unavailable
  
nginx

It works fine on the browser

user2629998 · Accepted Answer

item = soup.find("a", {"title":"Open in Google Play"})

You were initially searching for a "span" with a title "Open in Google Play", however the element that you're looking for is an "a" (a link).

Edit: since it appears that the server returns a 503 error, try setting a common user-agent with this code (not tested, it may not work at all; you'll need to import urllib2) :

soup = BeautifulSoup(urllib2.urlopen(urllib2.Request(sampleURL, None, {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0"})).read())
item = soup.find("a", {"title":"Open in Google Play"}) 
print item

Also I removed the useless "".join(pageHtml) since urllib2 already returns strings so there's no need for join.

Python html parsing using beautifulsoup framework

Answers (1)

Related Questions