m3nthal
m3nthal

Reputation: 433

The spider doesn't go to the next page

Spider code:

import scrapy
from crawler.items import Item

class DmozSpider(scrapy.Spider):
    name = 'blabla'
    allowed_domains = ['blabla']

    def start_requests(self):
        yield scrapy.Request('http://blabla.org/forum/viewforum.php?f=123', self.parse)

    def parse(self, response):
        item = Item()
        item['Title'] = response.xpath('//a[@class="title"/text()').extract()
        yield item

        next_page = response.xpath('//a[text()="Next"]/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, callback=self.parse)

Problem: spider stops after the first page even though next page_page and url exist and are correct.

Here is the last debug message before stop:

[scrapy] DEBUG: Crawled (200) <GET http://blabla.org/forum/viewforum.php?f=123&start=50> (referer: http://blabla.org/forum/viewforum.php?f=123)
[scrapy] INFO: Closing spider (finished)

Upvotes: 0

Views: 99

Answers (2)

m3nthal
m3nthal

Reputation: 433

The problem was that the response from the next page was a response for robots and did not contain any links.

Upvotes: 0

Niranjan Sagar
Niranjan Sagar

Reputation: 829

You need to check following this.

  1. Check if the urls that you are trying to crawl is not Robots.txt, which you can find by looking into http://blabla.org/robots.txt. By default scrapy obeys robots.txt. It is recommended that you abide to robots.txt
  2. By default the download delay for the scrapy is 0.25, you can increase it 2 Sec or more than that and try.

Upvotes: 1

Related Questions