Scrapy CrawlSpider not following links

Question

I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck!

Below is the code:

import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/chandoka",
    ]
    Rule(LinkExtractor(allow=r'/shop/cheese/.*'),
         callback='parse_items', follow=True)


    def parse_items(self, response):
        print "response", response
        hxs= HtmlXPathSelector(response)
        title=hxs.select("//*[@id='content']/div/h4").extract()
        title="".join(title)
        title=title.strip().replace("
","").lstrip()
        print "title is:",title

Can someone please advise what wrong I am doing here?

alecxe · Accepted Answer

The key problem with your code is that you have not set the rules for the CrawlSpider.

Other improvements I would suggest:

there is no need to instantiate HtmlXPathSelector, you can use response directly
select() is deprecated now, use xpath()
get the text() of the title element in order to retrieve, for instance, get Chandoka instead of Chandoka
I think you meant to start with the cheese shop catalog page instead: http://stinkybklyn.com/shop/cheese

The complete code with the applied improvements:

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]

    start_urls = [
        "http://stinkybklyn.com/shop/cheese",
    ]

    rules = [
        Rule(LinkExtractor(allow=r'/shop/cheese/.*'), callback='parse_items', follow=True)
    ]

    def parse_items(self, response):
        title = response.xpath("//*[@id='content']/div/h4/text()").extract()
        title = "".join(title)
        title = title.strip().replace("
", "").lstrip()
        print "title is:", title

Scrapy CrawlSpider not following links

Answers (2)

Related Questions