how to crawl a site only given domain url with scrapy

Question

I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?

I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.

David Thompson · Accepted Answer

I just found the answer myself. With the CrawlSpider class, we just need to set variable allow=() in the SgmlLinkExtractor function. As the documentation says:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

how to crawl a site only given domain url with scrapy

Answers (2)

Related Questions