Python, parsing html

Question

Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.

For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)

Movie Title Quality Torrent Link

There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:

item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]

And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:


    
        James Bond: Casino Royale (2006)
        Size: 1018.26 MB
        Quality: 720p
        Genre: Action | Crime
        IMDB Rating: 7.9/10
            
                Peers: 698
                Seeds: 356
            
    
    
        View Info
        Download

Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.

root · Accepted Answer

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)


In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.text
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]
    ...:     
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...

or to get exactly the output you wanted:

In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.find(text=True, recursive=False).strip()
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]

Python, parsing html

Answers (2)

Related Questions