Reputation: 766
I have the following spider for Scrapy. I need to scrape not only the top level pages in my sitemap but also the pages that are 1st-level children of those pages. Then I need to concatenate the results of the children's scrape with the body item from my parent parse method. Could anyone help me with the code to do something like this?
from scrapy.contrib.spiders import SitemapSpider
from scrapy.selector import HtmlXPathSelector
from cvorgs.items import CvorgSite
class CvorgSpider(SitemapSpider):
name = 'cvorg_spider'
sitemap_urls = ["http://www.urbanministry.org/cvorg_urls.xml"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = CvorgSite()
item['url'] = response.url
item['title'] = hxs.select('//title/text()').extract()
item['meta'] = hxs.select('/html/head/meta[@name="description"]/@content').extract()
body = ' '.join(hxs.select('//body//p//text()').extract())
item['body'] = body.replace('"', '\'');
return item
Upvotes: 0
Views: 1860
Reputation: 15923
Ok so you need to scrape a data like a url and re scrape it again .
here you need to use yield function .
like i fetch a suburl and redirect to give a new url. Here in example
callback=self.parse_category_tilte defines the function where the output from the
(complete_url(link) function will go:
sites1 = hxs.select('//div[@class="left-column"]/div[@class="resultContainer"]/span/h2/a/@href')
items=[]
for sit in sites2:
link=sit.extract()
yield Request(complete_url(link), callback=self.parse_category_tilte)
now the complete_url returns a new url:
def complete_url(string):
"""Return complete url"""
return "http://www.timeoutdelhi.net" + string
now re scrape in the parse_category_tilte function:
sites = hxs.select('//div[@class="box-header"]/h3/text()')
items=[]
for site in sites:
item=OnthegoItem()
item['ename']=site.extract()
items.append(item)
return items
Hope this helps and upvote.:)
Upvotes: 1