Reputation: 865
Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.
For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)
Movie Title Quality Torrent Link
There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:
item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]
And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:
<div class="browse-info">
<span class="info">
<h3><a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006">James Bond: Casino Royale (2006)</a></h3>
<p><b>Size:</b> 1018.26 MB</p>
<p><b>Quality:</b> 720p</p>
<p><b>Genre:</b> Action | Crime</p>
<p><b>IMDB Rating:</b> 7.9/10</p>
<span>
<p class="peers"><b>Peers:</b> 698</p>
<p class="peers"><b>Seeds:</b> 356</p>
</span>
</span>
<span class="links">
<a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006" class="std-btn-small mright">View Info<span></span></a>
<a href="http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent" class="std-btn-small mleft torrentDwl" data-movieID="2620" data-torrentID="2812">Download<span></span></a>
</span>
</div>
Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.
Upvotes: 0
Views: 2188
Reputation: 80436
from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.text
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
...:
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...
or to get exactly the output you wanted:
In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
...: name=i.find('a').text
...: for x in i.findAll('b'):
...: if x.text=="Quality:":
...: quality=x.parent.find(text=True, recursive=False).strip()
...: link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
...: print [name,quality,link]
Upvotes: 2
Reputation: 7343
As you request I paste simple example of parser. As you can see it's use lxml. With lxml you have two ways to work with DOM tree one of these is xpath and the second is css selectors I prefered xpath.
import lxml.html
import decimal
import urllib
def parse():
url = 'https://sometotosite.com'
doc = lxml.html.fromstring(urllib.urlopen(url).read())
main_div = doc.xpath("//div[@id='line']")[0]
main = {}
tr = []
for el in main_div.getchildren():
if el.xpath("descendant::a[contains(@name,'tn')]/text()"):
category = el.xpath("descendant::a[contains(@name,'tn')]/text()")[0]
main[category] = ''
tr = []
else:
for element in el.getchildren():
if '—' in lxml.html.tostring(element):
tr.append(element)
print category, tr
parse()
Upvotes: 0