Scrapy - Sending a new Request/using callback

Question

Delving deeper than the use of basic scraping functions.

I understand the basic BaseSpider class: name, allowed_domains, and how Request objects are sent for each start_url where the parse function is used as the callback function, and parse receives the Response.

I know my parse function saves an XPath response for data of each class called "service-name", I believe it then proceeds by looping through that data storing each XPath response match into the object "item" which is then sent to the "TgmItem class in my items.py container.

'newUrl' contains the concatenated URL that needs be scraped next, I need to figure out how to get the LinkParse function to scrape each newUrl found, or receive all links for it to scrape individually.

I know meta is used to parse my object item data, and callback gives the Request a function to send the response to.

LinkParse will be used to scrape more data from within all links scraped e.g: "item['test']=link.xpath('test()').extract())"

def parse(self, response):
    links = response.selector.xpath('//*[contains(@class, "service-name")]')
    for link in links:
        item = TgmItem()
        item['name'] = link.xpath('text()').extract()
        item['link'] = link.xpath('@href').extract()
        item['newUrl'] = response.url.join(item['link'])
        yield Request(newUrl, meta={'item':item}, callback=self.LinkParse)

def LinkParse(self, response):
    links = response.selector.xpath('*')
    for link in links:
        item = response.request.meta['item']
        item['test'] = link.xpath('text()').extract()
        yield item

I know in the callback function you parse a response (web page) which I need to be all or each link (but I think to solve this, I have to send the current response.url and deal with each/all link(s) in the ParseLink function.

I'm getting an error saying newUrl is not defined, guessing the Request can't accept that.

I'm not expecting any help here, if someone could please point me in the right direction, or something to research further?

alecxe · Accepted Answer

newUrl variable is not defined. Instead use item['newUrl']:

yield Request(item['newUrl'], meta={'item': item}, callback=self.LinkParse)

Also, response.url.join() call doesn't make sense to me. If you want combine the response.url with an href attribute value, use urljoin():

item['newUrl'] = urlparse.urljoin(response.url, item['link'])

Besides, I'm not sure what are you trying to do in the LinkParse callback. As far as I understand you want to follow "service-name" links and get additional data for each link. Then, I don't see why you need that for link in links loop in the LinkParse() method.

As far as I understand, your LinkParse() method should look like this:

def LinkParse(self, response):
    newfield = response.selector.xpath('//myfield/text()').extract()
    item = response.meta['item']
    item['newfield'] = newfield  
    return item

Scrapy - Sending a new Request/using callback

Answers (1)

Related Questions