Reputation: 1

Scrapy Modify Link to include Domain Name

I have an item, item['link'], of this form:

item['link'] = site.select('div[2]/div/h3/a/@href').extract()

The links it extracts are of this form :

'link': [u'/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],

I want them to be this way:

'link': [u'http://www.youtube.com/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],

Is it possible to do this directly, in scrapy, instead of reediting the list afterwards?

Upvotes: 0

Answers (4)

Kriti Rohilla

Reputation: 9

USE : response.urljoin() There is no such method to extract absolute url directly. You've got to use response.urljoin() and create another parse function that is parsed when with the help of callback. In this second parse function you can extract whatever you wish to.

Upvotes: 1

Chris Hawkes

Reputation: 12410

Yeah, everytime I'm grabbing a link I have to use the method urlparse.urljoin.

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)

I imagine your trying to grab the entire url to parse it right? if that's the case a simple two method system would work on a basespider. the parse method finds the link, sends it to the parse_url method which outputs what you're extracting to the pipeline

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)


def parse_url(self, response):
   hxs = HtmlXPathSelector(response)
   item = ZipgrabberItem()
   item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this grabs it
   return item

Upvotes: 2

warvariuc

Reputation: 59594

No, scrapy doesn't do this for you. According to the standard, URLs in HTML may be absolute or relative. scrapy sees you extracted urls just as data, it cannot know that they are urls, so you must join relative urls manually with the base url.

You need urlparse.urljoin:

Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
>>> import urlparse
>>> urlparse.urljoin('http://www.youtube.com', '/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189')
'http://www.youtube.com/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'
>>>

Upvotes: 1

akhter wahab

Reputation: 4085

you you really needs link as a list it would be fine for you.

item['link'] = ['http://www.youtube.com%s'%a for a in site.select('div[2]/div/h3/a/@href').extract()]

Upvotes: 1

Scrapy Modify Link to include Domain Name

Answers (4)

Related Questions