jajajasalu2
jajajasalu2

Reputation: 11

How to change depth limit while crawling with scrapy?

I want to either disable the depth checking and iteration for a method in my spider or change the depth limit while crawling. Here's some of my code:

    def start_requests(self):
        if isinstance(self.vuln, context.GenericVulnerability):
            yield Request(
                self.vuln.base_url,
                callback=self.determine_aliases,
                meta=self._normal_meta,
            )
        else:
            for url in self.vuln.entrypoint_urls:
                yield Request(
                    url, callback=self.parse, meta=self._patch_find_meta
                )


    @inline_requests
    def determine_aliases(self, response):
        vulns = [self.vuln]
        processed_vulns = set()
        while vulns:
            vuln = vulns.pop()
            if vuln.vuln_id is not self.vuln.vuln_id:
                response = yield Request(vuln.base_url)
            processed_vulns.add(vuln.vuln_id)
            aliases = context.create_vulns(*list(self.parse(response)))
            for alias in aliases:
                if alias.vuln_id in processed_vulns:
                    continue
                if isinstance(alias, context.GenericVulnerability):
                    vulns.append(alias)
                else:
                    logger.info("Alias discovered: %s", alias.vuln_id)
                    self.cves.add(alias)
        yield from self._generate_requests_for_vulns()


    def _generate_requests_for_vulns(self):
        for vuln in self.cves:
            for url in vuln.entrypoint_urls:
                yield Request(
                    url, callback=self.parse, meta=self._patch_find_meta
                )

My program is such that the user can give the depth limit they need/want as an input. Under some conditions, my default parse method allows recursively crawling links.

determine_aliases is kind of a preprocessing method, and the requests generated from _generate_requests_for_vulns are for the actual solution.

As you can see, I scrape the data I need from the response and store that in a set attribute 'cves' in my spider class from determine_aliases. Once that's done, I yield Requests w/r/t that data from _generate_requests_for_vulns.

The problem here is that either yielding requests from determine_aliases or calling determine_aliases as a callback iterates the depth. So when I yield Requests from _generate_requests_for_vulns for further crawling, my depth limit is reached sooner than expected.

Note that the actual crawling solution starts from the requests generated by _generate_requests_for_vulns, so the given depth limit should be applied only from those requests.

Upvotes: 0

Views: 642

Answers (1)

jajajasalu2
jajajasalu2

Reputation: 11

I ended up solving this by creating a middleware to reset the depth to 0. I pass a meta argument in the request with "reset_depth" as True, upon which the middleware alters the request's depth parameter.

class DepthResetMiddleware(object):

    def process_spider_output(self, response, result, spider):
        for r in result:
            if not isinstance(r, Request):
                yield r
                continue
            if (
                "depth" in r.meta
                and "reset_depth" in r.meta
                and r.meta["reset_depth"]
            ):
                r.meta["depth"] = 0
            yield r

The Request should be yielded from the spider somehow like this:

yield Request(url, meta={"reset_depth": True})

Then add the middleware to your settings. The order matters, as this middleware should be executed before the DepthMiddleware is. Since the default DepthMiddleware order is 900, I set DepthResetMiddleware's order to 850 in my CrawlerProcess like so:

"SPIDER_MIDDLEWARES": {
    "patchfinder.middlewares.DepthResetMiddleware": 850
}

Don't know if this is the best solution but it works. Another option is to perhaps extend DepthMiddleware and add this functionality there.

Upvotes: 1

Related Questions