Evan Donovan
Evan Donovan

Reputation: 766

Scraping child pages and concatenating the results in Scrapy

I have the following spider for Scrapy. I need to scrape not only the top level pages in my sitemap but also the pages that are 1st-level children of those pages. Then I need to concatenate the results of the children's scrape with the body item from my parent parse method. Could anyone help me with the code to do something like this?

from scrapy.contrib.spiders import SitemapSpider
from scrapy.selector import HtmlXPathSelector
from cvorgs.items import CvorgSite

class CvorgSpider(SitemapSpider):
  name = 'cvorg_spider'
  sitemap_urls = ["http://www.urbanministry.org/cvorg_urls.xml"]

  def parse(self, response):
   hxs = HtmlXPathSelector(response)
   item = CvorgSite()
   item['url'] = response.url
   item['title'] = hxs.select('//title/text()').extract()
   item['meta'] = hxs.select('/html/head/meta[@name="description"]/@content').extract()
   body = ' '.join(hxs.select('//body//p//text()').extract())
   item['body'] = body.replace('"', '\'');
   return item

Upvotes: 0

Views: 1860

Answers (1)

Tushar Gupta
Tushar Gupta

Reputation: 15923

Ok so you need to scrape a data like a url and re scrape it again . here you need to use yield function . like i fetch a suburl and redirect to give a new url. Here in example
callback=self.parse_category_tilte defines the function where the output from the (complete_url(link) function will go:

sites1 = hxs.select('//div[@class="left-column"]/div[@class="resultContainer"]/span/h2/a/@href')
        items=[]
        for sit in sites2:
            link=sit.extract()
            yield Request(complete_url(link), callback=self.parse_category_tilte)

now the complete_url returns a new url:

def complete_url(string):
    """Return complete url"""
    return "http://www.timeoutdelhi.net" + string

now re scrape in the parse_category_tilte function:

sites = hxs.select('//div[@class="box-header"]/h3/text()')       
        items=[]   
        for site in sites:
            item=OnthegoItem()
            item['ename']=site.extract()
            items.append(item)
        return items

Hope this helps and upvote.:)

Upvotes: 1

Related Questions