Reputation: 53
So I set up a spider very similarly to the example on scrapy.
I want the spider to crawl all of the quotes BEFORE going to the next page. I also want it to parse only 1 quote per second. So if there were 20 quotes on a page, it would take 20 seconds to scrape the quotes then 1 second to go to the next page.
As of right now, my current implementation is iterating through each page first before actually getting the quote information.
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author+a::attr(href)').extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_author)
# follow pagination links
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
Here are the basics of my settings.py file
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 2
Upvotes: 2
Views: 182
Reputation: 394
You could orchestrate how the scrapy.Requests are yielded.
For example, you could create the next page Request, but only yield it when all authors Requests terminate scraping its items.
Example:
import scrapy
# Store common info about pending request
pending_authors = {}
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# process pagination links
next_page = response.css('li.next a::attr(href)').extract_first()
next_page_request = None
if next_page is not None:
next_page = response.urljoin(next_page)
# Create the Request object, but does not yield it now
next_page_request = scrapy.Request(next_page, callback=self.parse)
# Requests scrapping of authors, and pass reference to the Request for next page
for href in response.css('.author+a::attr(href)').extract():
pending_authors[href] = False # Marks this author as 'not processed'
yield scrapy.Request(response.urljoin(href), callback=self.parse_author,
meta={'next_page_request': next_page_request})
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
item = {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
# marks this author as 'processed'
pending_authors[response.url] = True
# checks if finished processing of all authors
if len([value for key, value in pending_authors.iteritems() if value == False]) == 0:
yield item
next_page_request = response.meta['next_page_request']
# Requests next page, after finishinr all authors
yield next_page_request
else:
yield item
Upvotes: 1