Crawling a site recursively using scrapy

Question

I am trying to scrap a site using scrapy. URL of webpage I need to crawl looks like this: http://www.example.com/bla-bla-bla/2

Next page I need to crawl is: http://www.example.com/bla-bla-bla/3

and the next page I need to crawl is: http://www.example.com/bla-bla-bla/4

So on and so forth....

This is the code I have written so far based on Scrapy tutorial:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from schooldata.items import SchooldataItem

class tv_spider(CrawlSpider):
name = "tv"
allowed_domain = ["http://www.example.com"]
start_urls = [
    "http://www.example.com/bla-bla-bla/2"
]
#rules = [Rule(SgmlLinkExtractor(allow=['/\d+']), 'parse_tv')]
#rules = [Rule(SgmlLinkExtractor(allow=['/\d+']), callback='parse_tv')]
rules = (
    Rule(SgmlLinkExtractor(allow=r"bla-bla-bla/\d+"), follow=True, callback='parse_tv'),
)

def parse_tv(self, response):
    filename = response.url.split("/")[-2]
    open(filename, 'wb').write(response.body)

Problem I am facing is that crawler goes to the start page but does not scrap any pages after that. Also, please note that links to subsequent pages are not contained in start page.

What change I need to make to my code to accomplish this ?

Dmitry · Accepted Answer

Scrapy rules don't work correctly. Use something like that:

def start_requests(self):
    for i in range(1000):
        yield Request("http://www.example.com/bla-bla-bla/" + str(i), self.parse_tv)

where 1000 - total number pages.

Crawling a site recursively using scrapy

Answers (1)

Related Questions