Reputation: 37
I want to scrape data from this page. Here is my current code:
buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, "http://www.guardalo.org/99407/")
c.setopt(c.VERBOSE, 0)
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 15)
c.setopt(c.TIMEOUT, 15)
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.perform()
body = buf.getvalue()
c.close()
response = HtmlResponse(url='http://www.guardalo.org/99407/', body=body)
print Selector(response=response).xpath('//edindex/text()').extract()
It works, but I need title, video link and description as separate variables. How can I achieve this?
Upvotes: 1
Views: 949
Reputation: 473803
Title can be extracted using //title/text()
, video source link via //video/source/@src
:
selector = Selector(response=response)
title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]
code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)
print title
print description
print video_sources
print code
Prints:
Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8
Upvotes: 1
Reputation: 881547
No need for scrapy
for a single-URL fetch -- just get that single page's HTML with a simpler tool (even simplest urllib.urlopen(theurl).read()
!) then analyze the HTML e.g with BeautifulSoup. From a simple "view source" it looks like you're looking for:
<title>Best Babies Laughing Video Compilation 2012 [HD] - Guardalo</title>
(the title), one of the three:
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.mp4" type='video/mp4'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.webm" type='video/webm'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.ogv" type='video/ogg'>
(the video linkS, plural, and I can't pick one because you don't tell us which format[s] you prefer!-), and
<meta name="description" content="Ciao a tutti amici di guardalo,quello che propongo oggi è un video sui neonati buffi con risate" />
(the description). BeautifulSoup makes it pretty trivial to get each one, e.g after the needed imports
html = urllib.urlopen('http://www.guardalo.org/99407/').read()
soup = BeautifulSoup(html)
title = soup.find('title').text
etc etc (but you'll have to pick one video link -- and I see in their sources they're mentioned as "pre-rolls" so it may be that the links to actual non-ads videos are in fact not on the page but only accessible after a log-in or whatever).
Upvotes: 1