Reputation: 298
There are two spiders which use the same resource file and almost the same structure.
The spiderA contains :
import scrapy
import pkgutil
class StockSpider(scrapy.Spider):
name = "spiderA"
data = pkgutil.get_data("tutorial", "resources/webs.txt")
data = data.decode()
urls = data.split("\r\n")
start_urls = [url + "string1" for url in urls]
def parse(self, response):
pass
The spiderB contains :
import scrapy
import pkgutil
class StockSpider(scrapy.Spider):
name = "spiderB"
data = pkgutil.get_data("tutorial", "resources/webs.txt")
data = data.decode()
urls = data.split("\r\n")
start_urls = [url + "string2" for url in urls]
def parse(self, response):
pass
How can I combine spiderA and spiderB, and add a switch variable to let crapy scral
call different spider depending on my need?
Upvotes: 4
Views: 265
Reputation: 298
spider_type
result in error
NameError: name 'spider_type' is not defined.
It is self.spider_type in spider class.
import scrapy
import pkgutil
class StockSpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
if not hasattr(self, 'spider_type'):
self.logger.error('No spider_type specified')
return
data = pkgutil.get_data("tutorial", "resources/webs.txt")
data = data.decode()
for url in data.split("\r\n"):
if self.spider_type == 'first':
url += 'first'
if self.spider_type == 'second':
url += 'second'
yield scrapy.Request(url)
def parse(self, response):
pass
To make it more strictly and accurately.
scrapy crawl myspider -a spider_type='second'
Upvotes: 0
Reputation: 3717
Try to add separate parameter for spider type. You can set it with calling scrapy crawl myspider -a spider_type=second
. Check this code example:
import scrapy
import pkgutil
class StockSpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
if not hasattr(self, 'spider_type'):
self.logger.error('No spider_type specified')
return
data = pkgutil.get_data("tutorial", "resources/webs.txt")
data = data.decode()
for url in data.split("\r\n"):
if self.spider_type == 'first':
url += 'first'
if self.spider_type == 'second':
url += 'second'
yield scrapy.Request(url)
def parse(self, response):
pass
And also you can always create base main class and then inherit from it, overloading only one variable (that you add to url) and name (for separate calls).
Upvotes: 2