Scrapy Spider pagination ending early

Question

I am working on a scrapy spider for a project. Most of the websites I am scraping have the general format of a search page with listing pages. I've written a spider to scrape data for each listing from both the search page and the listing page. The issue I'm having though is that while scraping my spider will scrape all of the search pages and queue up listing pages to be scraped, but once it reaches the final search page then the spider closes. Sometimes it will also end before reaching the final page. If I have it run just a single search page (no pagination) then all listings are returned. I'm still learning so I'm sure I've missed something.

I've written an example spider here using the same structure as the one I've written.

import scrapy

class exampleSpider(scrapy.Spider):
    name = 'exampleSpider'
    
    start_urls = ['eample.com/pages=1']

    custom_settings={ 'FEED_URI': "example_%(time)s.csv", 'FEED_FORMAT': 'csv'}

    def parse(self, response):
        for post in response.css('.job-listings'):
            url = post.css('.job-url::text').get()
            title = post.css('.job-title::text').get()
            yield scrapy.Request(url=url, callback=self.parse_listing,meta={'url':url,'title':title})

        #pagination
        next_page = response.css('.pagination li:last-child a::attr(href)').get()
        if next_page is not None:
            next_page = 'example.com' + next_page
            yield scrapy.Request(url=next_page, callback=self.parse)

    def parse_listing(self, response):
        yield{
              'url': response.meta['url'],
              'title': response.meta['title'],
              'company': response.css('.row:nth-child(1) a::text').get(),
              'specialty': response.css('.row:nth-child(2) a::text').get(),
              'city': response.css('.value span:nth-child(1)::text').get(),
              'state': response.css('.value span+ span::text').get(),
              'job type': response.css('.row:nth-child(4) .value::text').get(),
         }

Here is the output I typically get after running the spider. This website for example has around 6000 pages, but it only makes it through 153.

2021-01-11 17:43:00 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-11 17:43:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 115052,
 'downloader/request_count': 203,
 'downloader/request_method_count/GET': 203,
 'downloader/response_bytes': 2272046,
 'downloader/response_count': 203,
 'downloader/response_status_count/200': 173,
 'downloader/response_status_count/404': 2,
 'downloader/response_status_count/429': 27,
 'downloader/response_status_count/500': 1,
 'elapsed_time_seconds': 223.01121,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 1, 11, 23, 43, 0, 37435),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/404': 1,
 'item_scraped_count': 153,
 'log_count/DEBUG': 356,
 'log_count/ERROR': 6,
 'log_count/INFO': 14,
 'request_depth_max': 13,
 'response_received_count': 175,
 'retry/count': 28,
 'retry/reason_count/429 Unknown Status': 27,
 'retry/reason_count/500 Internal Server Error': 1,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 202,
 'scheduler/dequeued/memory': 202,
 'start_time': datetime.datetime(2021, 1, 11, 23, 39, 17, 26225)}
2021-01-11 17:43:00 [scrapy.core.engine] INFO: Spider closed (finished)

Scrapy Spider pagination ending early

Answers (1)

Related Questions