Reputation: 1546
My site contain 3 levels.
I want to scrape the data from all the street pages. For this I have built a spider. Now how do I get from Country to streets without adding a million URL's in the start_url field.
Do I build a spider for country, one for city and one for street? Isn't the whole idea of Crawling that the crawler follows all links down to a certain depth?
Adding DEPTH_LIMIT = 3 to the settings.py file did not change anything.
I start the crawl by: scrapy crawl spidername
EDIT
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector
from winkel.items import WinkelItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["mydomain.nl"]
start_urls = [
"http://www.mydomain.nl/Zuid-Holland"
]
rules = (Rule(SgmlLinkExtractor(allow=('*Zuid-Holland*', )), callback='parse_winkel', follow=True),)
def parse_winkel(self, response):
sel = Selector(response)
sites = sel.xpath('//ul[@id="itemsList"]/li')
items = []
for site in sites:
item = WinkelItem()
item['adres'] = site.xpath('.//a/text()').extract(), site.xpath('text()').extract(), sel.xpath('//h1/text()').re(r'winkel\s*(.*)')
items.append(item)
return items
Upvotes: 1
Views: 457
Reputation: 473773
You need to make use of CrawlSpider
, define Rules with Link Extractors for countries, cities and streets.
For example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(SgmlLinkExtractor(allow=('country', )), callback='parse_country'),
Rule(SgmlLinkExtractor(allow=('city', )), callback='parse_city'),
Rule(SgmlLinkExtractor(allow=('street', )), callback='parse_street'),
)
def parse_country(self, response):
self.log('Hi, this is a country page! %s' % response.url)
def parse_city(self, response):
self.log('Hi, this is a city page! %s' % response.url)
def parse_street(self, response):
self.log('Hi, this is a street page! %s' % response.url)
Upvotes: 2