Reputation: 149
I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?
I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.
Upvotes: 6
Views: 6428
Reputation: 149
I just found the answer myself. With the CrawlSpider
class, we just need to set variable allow=()
in the SgmlLinkExtractor
function. As the documentation says:
allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
Upvotes: 5
Reputation: 4364
In your Spider
, define allowed_domains
as a list of domains you want to crawl.
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
Then you can use response.follow()
to follow the links. See the docs for Spiders and the tutorial.
Alternatively, you can filter the domains with a LinkExtractor
(like David Thompson mentioned).
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
for a in LinkExtractor(allow_domains=['quotes.toscrape.com']).extract_links(response):
yield response.follow(a, callback=self.parse)
Upvotes: 5