David Thompson
David Thompson

Reputation: 149

how to crawl a site only given domain url with scrapy

I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?

I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.

Upvotes: 6

Views: 6428

Answers (2)

David Thompson
David Thompson

Reputation: 149

I just found the answer myself. With the CrawlSpider class, we just need to set variable allow=() in the SgmlLinkExtractor function. As the documentation says:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

Upvotes: 5

jpyams
jpyams

Reputation: 4364

In your Spider, define allowed_domains as a list of domains you want to crawl.

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']

Then you can use response.follow() to follow the links. See the docs for Spiders and the tutorial.

Alternatively, you can filter the domains with a LinkExtractor (like David Thompson mentioned).

from scrapy.linkextractors import LinkExtractor

class QuotesSpider(scrapy.Spider):

    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        for a in LinkExtractor(allow_domains=['quotes.toscrape.com']).extract_links(response):
            yield response.follow(a, callback=self.parse)

Upvotes: 5

Related Questions