Reputation: 3717
So I am passing in a start_url
that is a page of news articles(ex. cnn.com). But, I just want to extract the news article itself, I don't want to follow any links on article page. To do that, I'm using a CrawlSpider
with the following rule:
rules = (
Rule(LinkExtractor(allow=('regexToMatchArticleUrls',),
deny=('someDenyUrls')), callback='parse_article_page'),
)
def parse_article_page(self,response):
#extracts the title, date, body, etc of article
I've enabled the scrapy.spidermiddlewares.depth.DepthMiddleware
and set DEPTH_LIMIT = 1
.
However, I'm still getting links crawled from the individual article pages that happen to match the regexToMatchArticleUrls
, as they are links to other parts of the same website (and I cannot make the regex more restrictive).
But, why are these links getting crawled at all when the DEPTH_LIMIT=1
? Is it because the DEPTH_LIMIT
resets for each link extracted from LinkExtractor
, ie. the article page urls? Is there a way either to make DEPTH_LIMIT
work or extend the DepthMiddleware
to not crawl links on the article page? Thanks!
Upvotes: 4
Views: 1572
Reputation: 1981
For the DepthMiddleware to work correctly the meta attribute needs to be passed from one request to another, otherwise, depth
will be set to 0 after each new request.
Unfortunaly, by default, the CrawlSpider doesn't keep this meta attribute from one requests to the next.
This can be solved by using spider middlewares (middlewares.py
):
from scrapy import Request
class StickyDepthSpiderMiddleware:
def process_spider_output(self, response, result, spider):
key_found = response.meta.get('depth', None)
for x in result:
if isinstance(x, Request) and key_found is not None:
x.meta.setdefault('depth', key_found)
yield x
Also, don't forget to include this middleware on your settings.py
:
SPIDER_MIDDLEWARES = { '{your_project_name}.middlewares.StickyDepthSpiderMiddleware' : 100 }
Upvotes: 3