Jigno Alfred Venezuela
Jigno Alfred Venezuela

Reputation: 147

Q: Scrapy: Next pages not crawled but crawler seems to be following links

I'm trying to learn python and scrapy but I'm having problems with CrawlSpider. The code below works for me. It takes all the links in the start url that matches the xpath - //div[@class="info"]/h3/a/@href then passes those links to the function parse_dir_contents.

What I need now, is to get the crawler to move to the next page. I tried to use rules and linkextractor but I can't seem to make it work properly. I also tried using //a/@href as the xpath for the parse function but it wouldn't pass the links to the parse_dir_contents function. I think I'm missing something REALLY simple. Any ideas?

class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]

rules = [
    Rule(LinkExtractor(allow=['restaurants?page=[1-2]']), callback="parse")
]

def parse(self, response):
    for href in response.xpath('//div[@class="info"]/h3/a/@href'):
        url = response.urljoin(href.extract())
        if 'mip' in url:
            yield scrapy.Request(url, callback=self.parse_dir_contents)


def parse_dir_contents(self, response):
    for sel in response.xpath('//div[@id="mip"]'):
        item = ypItem()
        item['url'] = response.url
        item['business'] = sel.xpath('//div/div/h1/text()').extract()
        ---extra items here---
        yield item

Edit: Here's the updated code with three functions and is able to scrape 150 items. I think it's a problem with my rules but I've tried what I think might work, but still same output.

class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]

rules = [
    Rule(LinkExtractor(allow=[r'restaurants\?page\=[1-2]']), callback='parse')
]

def parse(self, response):
    for href in response.xpath('//a/@href'):
        url = response.urljoin(href.extract())
        if 'restaurants?page=' in url:
            yield scrapy.Request(url, callback=self.parse_links)


def parse_links(self, response):
    for href in response.xpath('//div[@class="info"]/h3/a/@href'):
        url = response.urljoin(href.extract())
        if 'mip' in url:
            yield scrapy.Request(url, callback=self.parse_page)


def parse_page(self, response):
    for sel in response.xpath('//div[@id="mip"]'):
        item = ypItem()
        item['url'] = response.url
        item['business'] = sel.xpath('//div/div/h1/text()').extract()
        item['phone'] = sel.xpath('//div/div/section/div/div[2]/p[3]/text()').extract()
        item['street'] = sel.xpath('//div/div/section/div/div[2]/p[1]/text()').re(r'(.+)\,')
        item['city'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(.+)\,')
        item['state'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'\,\s(.+)\s\d')
        item['zip'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(\d+)')
        item['category'] = sel.xpath('//dd[@class="categories"]/span/a/text()').extract()
        yield item

Upvotes: 0

Views: 357

Answers (2)

Tony Montana
Tony Montana

Reputation: 1019

I know this is very late to answer this problem, but I managed to solve it and I am posting my answer cause it might be helpful for someone like me who was confused about how to use scrapy Rule and LinkExtractor in the first place.

This is my working code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ypSpider(CrawlSpider):
    name = "ypTest"
    allowed_domains = ["yellowpages.com"]
    start_urls = ['http://www.yellowpages.com/new-york-ny/restaurants'
             ]
    rules = (
        Rule(LinkExtractor(allow=[r'restaurants\?page=\d+']), follow=True), # Scrapes all the pagination links 
        Rule(LinkExtractor(restrict_xpaths="//div[@class='scrollable-pane']//a[@class='business-name']"), callback='parse_item'), # Scrapes all the restaurant detail links and use `parse_item` as a callback method
    )

    def parse_item(self, response):
        yield {
            'url' : response.url
        }

So, I managed to understand how Rule and LinkExtractor works in this scenario.

First Rule entry is for scraping all the pagination links and the allow parameter in LinkExtractor function is basically using regex to pass only those links which match the regex. In this scenario, as per regex, only links which contain pattern like restaurants\?page=\d+ where \d+ means one or more numbers. Also, is uses the default parse method as a callback. (In this, I could use restrict_xpath parameter to choose only those links which come under a specific region in HTML, and not allow parameter but I use it to understand how it works with regex)

Second Rule is for fetching all restaurants detail links and parsing those using parse_item method. Here in this Rule, we are using restrict_xpaths parameter, which defines regions inside the response where links should be extracted from. Here, we are fetching only those content which comes under div with class scrollable-pane and only those links which has class business-name, as if you inspect the HTML, you'll find more than one links to the same restaurant with different query parameters in same div. And in the last, we are passing our callback method parse_item.

Now, when I run this spider, it fetches all restaurant(Restaurants in New York, NY) details which are in total 3030, in this scenario.

Upvotes: 0

Steve
Steve

Reputation: 976

CrawlSpider uses the parse routine for its own purposes, rename your parse() to something else, change the callback in rules[] to match and try again.

Upvotes: 1

Related Questions