Reputation: 1
I have an item, item['link']
, of this form:
item['link'] = site.select('div[2]/div/h3/a/@href').extract()
The links it extracts are of this form :
'link': [u'/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],
I want them to be this way:
'link': [u'http://www.youtube.com/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],
Is it possible to do this directly, in scrapy, instead of reediting the list afterwards?
Upvotes: 0
Views: 3768
Reputation: 9
USE : response.urljoin()
There is no such method to extract absolute url directly. You've got to use response.urljoin() and create another parse function that is parsed when with the help of callback. In this second parse function you can extract whatever you wish to.
Upvotes: 1
Reputation: 12410
Yeah, everytime I'm grabbing a link I have to use the method urlparse.urljoin.
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
I imagine your trying to grab the entire url to parse it right? if that's the case a simple two method system would work on a basespider. the parse method finds the link, sends it to the parse_url method which outputs what you're extracting to the pipeline
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this grabs it
return item
Upvotes: 2
Reputation: 59594
No, scrapy doesn't do this for you. According to the standard, URLs in HTML may be absolute or relative. scrapy sees you extracted urls just as data, it cannot know that they are urls, so you must join relative urls manually with the base url.
You need urlparse.urljoin
:
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
>>> import urlparse
>>> urlparse.urljoin('http://www.youtube.com', '/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189')
'http://www.youtube.com/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'
>>>
Upvotes: 1
Reputation: 4085
you you really needs link as a list it would be fine for you.
item['link'] = ['http://www.youtube.com%s'%a for a in site.select('div[2]/div/h3/a/@href').extract()]
Upvotes: 1