Reputation: 740
I am trying to dowload a HTML only website using scrapy. I am using the CrawlSpider class to achieve this. Here's how my parser looks like. My crawler downloads the HTML source of the pages and makes a local mirror of the website. It mirrors the website successfully, but without images. To download the images attached to each page, I tried adding:
def parse_link(self, response):
# Download the source of the page
# CODE HERE
# Now search for images
x = HtmlXPathSelector(response)
imgs = x.select('//img/@src').extract()
# Download images
for i in imgs:
r = Request(urljoin(response.url, i), callback=self.parse_link)
# execute the request here
In the examples in Scrapy's Documentation , the parser seems to return the Request object which then get's executed.
Is there a way to execute the Request by hand, so as to get a Response? I need to execute multiple requests per parse_link call.
Upvotes: 3
Views: 3711
Reputation: 8212
You could download images with the Images pipeline. Or if you want to execute the Requests manually, use yield
:
def parse_link(self, response):
"""Download the source of the page"""
# CODE HERE
item = my_loader.load_item()
# Now search for images
imgs = HtmlXPathSelector(response).select('//img/@src').extract()
# Download images
path = '/local/path/to/where/i/want/the/images/'
item['path'] = path
for i in imgs:
image_src = i[0]
item['images'].append(image_src)
yield Request(urljoin(response.url, image_src),
callback=self.parse_images,
meta=dict(path=path))
yield item
def parse_images(self, response):
"""Save images to disk"""
path = response.meta.get('path')
n = get_the_filename(response.url)
f = open(path + n, 'wb')
f.write(response.body)
Upvotes: 2