user210733
user210733

Reputation:

Crawling multiple starting urls with different depth

I'm trying to get Scrapy 0.12 to change it's "maximum depth" setting for different url in the start_urls variable in the spider.

If I understand correctly the documentation there's no way because the DEPTH_LIMIT setting is global for the entire framework and there's no notion of "requests originated from the initial one".

Is there a way to circumvent this? Is it possible to have multiple instances of the same spider initialized with each starting url and different depth limits?

Upvotes: 1

Views: 922

Answers (1)

warvariuc
warvariuc

Reputation: 59604

Sorry, looks like i didn't understand you question correctly from the beginning. Correcting my answer:

Responses have depth key in meta. You can check it and take appropriate action.

class MySpider(BaseSpider):

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True, meta={'start_url': url})

    def parse(self, response):
        if response.meta['start_url'] == '???' and response.meta['depth'] > 10:
            # do something here for exceeding limit for this start url
        else:
            # find links and yield requests for them with passing the start url
            yield Request(other_url, meta={'start_url': response.meta['start_url']})

http://doc.scrapy.org/en/0.12/topics/spiders.html#scrapy.spider.BaseSpider.make_requests_from_url

Upvotes: 1

Related Questions