showkey
showkey

Reputation: 298

How can I combine the two spiders into just one?

There are two spiders which use the same resource file and almost the same structure.

The spiderA contains :

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "spiderA"
    data = pkgutil.get_data("tutorial", "resources/webs.txt")
    data = data.decode()
    urls = data.split("\r\n")
    start_urls = [url + "string1"  for url in urls]

    def parse(self, response):
        pass

The spiderB contains :

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "spiderB"
    data = pkgutil.get_data("tutorial", "resources/webs.txt")
    data = data.decode()
    urls = data.split("\r\n")
    start_urls = [url + "string2"  for url in urls]

    def parse(self, response):
        pass

How can I combine spiderA and spiderB, and add a switch variable to let crapy scral call different spider depending on my need?

Upvotes: 4

Views: 265

Answers (2)

showkey
showkey

Reputation: 298

spider_type result in error

NameError: name 'spider_type' is not defined.

It is self.spider_type in spider class.

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        if not hasattr(self, 'spider_type'):
            self.logger.error('No spider_type specified')
            return
        data = pkgutil.get_data("tutorial", "resources/webs.txt")
        data = data.decode()

        for url in data.split("\r\n"):
            if self.spider_type == 'first':
                url += 'first'
            if self.spider_type == 'second':
                url += 'second'
            yield scrapy.Request(url)

    def parse(self, response):
        pass

To make it more strictly and accurately.

scrapy crawl myspider -a spider_type='second'

Upvotes: 0

vezunchik
vezunchik

Reputation: 3717

Try to add separate parameter for spider type. You can set it with calling scrapy crawl myspider -a spider_type=second. Check this code example:

import scrapy
import pkgutil

class StockSpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        if not hasattr(self, 'spider_type'):
            self.logger.error('No spider_type specified')
            return
        data = pkgutil.get_data("tutorial", "resources/webs.txt")
        data = data.decode()

        for url in data.split("\r\n"):
            if self.spider_type == 'first':
                url += 'first'
            if self.spider_type == 'second':
                url += 'second'
            yield scrapy.Request(url)

    def parse(self, response):
        pass

And also you can always create base main class and then inherit from it, overloading only one variable (that you add to url) and name (for separate calls).

Upvotes: 2

Related Questions