Reputation: 31
I am working on a scrapy spider for a project. Most of the websites I am scraping have the general format of a search page with listing pages. I've written a spider to scrape data for each listing from both the search page and the listing page. The issue I'm having though is that while scraping my spider will scrape all of the search pages and queue up listing pages to be scraped, but once it reaches the final search page then the spider closes. Sometimes it will also end before reaching the final page. If I have it run just a single search page (no pagination) then all listings are returned. I'm still learning so I'm sure I've missed something.
I've written an example spider here using the same structure as the one I've written.
import scrapy
class exampleSpider(scrapy.Spider):
name = 'exampleSpider'
start_urls = ['eample.com/pages=1']
custom_settings={ 'FEED_URI': "example_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
for post in response.css('.job-listings'):
url = post.css('.job-url::text').get()
title = post.css('.job-title::text').get()
yield scrapy.Request(url=url, callback=self.parse_listing,meta={'url':url,'title':title})
#pagination
next_page = response.css('.pagination li:last-child a::attr(href)').get()
if next_page is not None:
next_page = 'example.com' + next_page
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_listing(self, response):
yield{
'url': response.meta['url'],
'title': response.meta['title'],
'company': response.css('.row:nth-child(1) a::text').get(),
'specialty': response.css('.row:nth-child(2) a::text').get(),
'city': response.css('.value span:nth-child(1)::text').get(),
'state': response.css('.value span+ span::text').get(),
'job type': response.css('.row:nth-child(4) .value::text').get(),
}
Here is the output I typically get after running the spider. This website for example has around 6000 pages, but it only makes it through 153.
2021-01-11 17:43:00 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-11 17:43:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 115052,
'downloader/request_count': 203,
'downloader/request_method_count/GET': 203,
'downloader/response_bytes': 2272046,
'downloader/response_count': 203,
'downloader/response_status_count/200': 173,
'downloader/response_status_count/404': 2,
'downloader/response_status_count/429': 27,
'downloader/response_status_count/500': 1,
'elapsed_time_seconds': 223.01121,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 1, 11, 23, 43, 0, 37435),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'item_scraped_count': 153,
'log_count/DEBUG': 356,
'log_count/ERROR': 6,
'log_count/INFO': 14,
'request_depth_max': 13,
'response_received_count': 175,
'retry/count': 28,
'retry/reason_count/429 Unknown Status': 27,
'retry/reason_count/500 Internal Server Error': 1,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 202,
'scheduler/dequeued/memory': 202,
'start_time': datetime.datetime(2021, 1, 11, 23, 39, 17, 26225)}
2021-01-11 17:43:00 [scrapy.core.engine] INFO: Spider closed (finished)
Upvotes: 2
Views: 241
Reputation: 31
I found the fix to my issue to anyone else experiencing similar things. There were 2 issues I found. First The website I was scraping had slightly different layouts for the job listings. There was a different class name for advertised vs regular postings, so once I got to around page 35 my for loop was checking None and ended the scrape. The second issue was that some of the listing pages no longer existed, but were still posted. So when the scraper attempted to scrape it None was once again returned. So the lesson here is use try and except statements as my issue did not have anything to do with pagination like I thought. Here is the updated code that now works for me.
import scrapy
class exampleSpider(scrapy.Spider):
name = 'exampleSpider'
start_urls = ['eample.com/pages=1']
custom_settings={ 'FEED_URI': "example_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
if(response.css('.job-listings') == []):
try:
for post in response.css('.job-listings-old'):
url = post.css('.job-url::text').get()
title = post.css('.job-title::text').get()
yield scrapy.Request(url=url, callback=self.parse_listing,meta {'url':url,'title':title})
except Exception e:
print(e)
else:
try:
for post in response.css('.job-listings'):
url = post.css('.job-url::text').get()
title = post.css('.job-title::text').get()
yield scrapy.Request(url=url, callback=self.parse_listing,meta {'url':url,'title':title})
except Exception e:
print(e)
#pagination
next_page = response.css('.pagination li:last-child a::attr(href)').get()
if next_page is not None:
next_page = 'example.com' + next_page
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_listing(self, response):
yield{
'url': response.meta['url'],
'title': response.meta['title'],
'company': response.css('.row:nth-child(1) a::text').get(),
'specialty': response.css('.row:nth-child(2) a::text').get(),
'city': response.css('.value span:nth-child(1)::text').get(),
'state': response.css('.value span+ span::text').get(),
'job type': response.css('.row:nth-child(4) .value::text').get(),
}
Upvotes: 1