Reputation: 1
start_urls = ['https://image.jpg']
def start_requests(self):
for url in self.start_urls:
request = scrapy.Request(url,callback=self.parse)
yield request
def parse(self, response):
item = GetImgsItem()
# print(response.url)
item['image_urls'] = response.url
yield item
My spider can now download the image from start_urls but the request was sent twice to give one image. How should I turn it to download in start_requests ?
Question 2: I created two spiders (spider A , spider B) in my project. In spider A, I have a specific pipeline class to deal the downloaded items. It works well now.
But later when I used spider B, it also used the same pipeline class of spider A. How should I set pipeline class so that it is exclusive for spider A to use ?
Upvotes: 0
Views: 190
Reputation: 299
For the first question, you could start with a dummy request and then yield image items in your parse method. This could avoid some hacks to other middlewares.
start_urls = ['https://any.dummy.website']
image_urls = [...]
def parse(self, dummy_response):
yield Item(image_urls=self.image_urls)
Upvotes: 0
Reputation: 33
To answer your second question take a look at this post:
How can I use different pipelines for different spiders in a single Scrapy project
You can also just delete the pipeline part in your settings.py file and create custom_settings in your spider.
class SpiderA(scrapy.Spider):
name = 'spider_a'
custom_settings = {
'ITEM_PIPELINES': {
'project.pipelines.MyPipeline': 300
}
}
But I think the example shown in the post above is a bit more elegant.
Upvotes: 1