Crawling depth automation

Question

My site contain 3 levels.

Country
- City
  - Street

I want to scrape the data from all the street pages. For this I have built a spider. Now how do I get from Country to streets without adding a million URL's in the start_url field.

Do I build a spider for country, one for city and one for street? Isn't the whole idea of Crawling that the crawler follows all links down to a certain depth?

Adding DEPTH_LIMIT = 3 to the settings.py file did not change anything.

I start the crawl by: scrapy crawl spidername

EDIT

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector
from winkel.items import WinkelItem

class DmozSpider(CrawlSpider):
name = "dmoz"
    allowed_domains = ["mydomain.nl"]
    start_urls = [
        "http://www.mydomain.nl/Zuid-Holland"
    ]

    rules = (Rule(SgmlLinkExtractor(allow=('*Zuid-Holland*', )), callback='parse_winkel', follow=True),)

    def parse_winkel(self, response):
        sel = Selector(response)
        sites = sel.xpath('//ul[@id="itemsList"]/li')
        items = []

        for site in sites:
        item = WinkelItem()
        item['adres'] = site.xpath('.//a/text()').extract(), site.xpath('text()').extract(), sel.xpath('//h1/text()').re(r'winkel\s*(.*)')
        items.append(item)
        return items

alecxe · Accepted Answer

You need to make use of CrawlSpider, define Rules with Link Extractors for countries, cities and streets.

For example:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(SgmlLinkExtractor(allow=('country', )), callback='parse_country'),
        Rule(SgmlLinkExtractor(allow=('city', )), callback='parse_city'),
        Rule(SgmlLinkExtractor(allow=('street', )), callback='parse_street'),
    )

    def parse_country(self, response):
        self.log('Hi, this is a country page! %s' % response.url)

    def parse_city(self, response):
        self.log('Hi, this is a city page! %s' % response.url)

    def parse_street(self, response):
        self.log('Hi, this is a street page! %s' % response.url)

Crawling depth automation

Answers (1)

Related Questions