How to change depth limit while crawling with scrapy?

Question

I want to either disable the depth checking and iteration for a method in my spider or change the depth limit while crawling. Here's some of my code:

    def start_requests(self):
        if isinstance(self.vuln, context.GenericVulnerability):
            yield Request(
                self.vuln.base_url,
                callback=self.determine_aliases,
                meta=self._normal_meta,
            )
        else:
            for url in self.vuln.entrypoint_urls:
                yield Request(
                    url, callback=self.parse, meta=self._patch_find_meta
                )


    @inline_requests
    def determine_aliases(self, response):
        vulns = [self.vuln]
        processed_vulns = set()
        while vulns:
            vuln = vulns.pop()
            if vuln.vuln_id is not self.vuln.vuln_id:
                response = yield Request(vuln.base_url)
            processed_vulns.add(vuln.vuln_id)
            aliases = context.create_vulns(*list(self.parse(response)))
            for alias in aliases:
                if alias.vuln_id in processed_vulns:
                    continue
                if isinstance(alias, context.GenericVulnerability):
                    vulns.append(alias)
                else:
                    logger.info("Alias discovered: %s", alias.vuln_id)
                    self.cves.add(alias)
        yield from self._generate_requests_for_vulns()


    def _generate_requests_for_vulns(self):
        for vuln in self.cves:
            for url in vuln.entrypoint_urls:
                yield Request(
                    url, callback=self.parse, meta=self._patch_find_meta
                )

My program is such that the user can give the depth limit they need/want as an input. Under some conditions, my default parse method allows recursively crawling links.

determine_aliases is kind of a preprocessing method, and the requests generated from _generate_requests_for_vulns are for the actual solution.

As you can see, I scrape the data I need from the response and store that in a set attribute 'cves' in my spider class from determine_aliases. Once that's done, I yield Requests w/r/t that data from _generate_requests_for_vulns.

The problem here is that either yielding requests from determine_aliases or calling determine_aliases as a callback iterates the depth. So when I yield Requests from _generate_requests_for_vulns for further crawling, my depth limit is reached sooner than expected.

Note that the actual crawling solution starts from the requests generated by _generate_requests_for_vulns, so the given depth limit should be applied only from those requests.

jajajasalu2 · Accepted Answer

I ended up solving this by creating a middleware to reset the depth to 0. I pass a meta argument in the request with "reset_depth" as True, upon which the middleware alters the request's depth parameter.

class DepthResetMiddleware(object):

    def process_spider_output(self, response, result, spider):
        for r in result:
            if not isinstance(r, Request):
                yield r
                continue
            if (
                "depth" in r.meta
                and "reset_depth" in r.meta
                and r.meta["reset_depth"]
            ):
                r.meta["depth"] = 0
            yield r

The Request should be yielded from the spider somehow like this:

yield Request(url, meta={"reset_depth": True})

Then add the middleware to your settings. The order matters, as this middleware should be executed before the DepthMiddleware is. Since the default DepthMiddleware order is 900, I set DepthResetMiddleware's order to 850 in my CrawlerProcess like so:

"SPIDER_MIDDLEWARES": {
    "patchfinder.middlewares.DepthResetMiddleware": 850
}

Don't know if this is the best solution but it works. Another option is to perhaps extend DepthMiddleware and add this functionality there.

How to change depth limit while crawling with scrapy?

Answers (1)

Related Questions