Reputation: 933
My intention is to scrape a few urls using a spider such as follows:
import scrapy
from ..items import ContentsPageSFBItem
class BasicSpider(scrapy.Spider):
name = "contentspage_sfb"
#allowed_domains = ["web"]
start_urls = [
'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/',
'https://www.safaribooksonline.com/library/view/cisa-certified-information/9780134677453/'
]
def parse(self, response):
item = ContentsPageSFBItem()
#from scrapy.shell import inspect_response
#inspect_response(response, self)
content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()
for content_item in content_items:
item['content_item'] = content_item
item["full_url"] = response.url
item['title'] = response.xpath('//title[1]/text()').extract()
yield item
I intend to use more urls. My intention is to create a restartable spider incase something goes wrong. My plan is to add exceptions and create a csv with the list of remaining urls. Where exactly can I add this functionality?
Upvotes: 0
Views: 59
Reputation: 33420
You could store the current url in which has occurred such problem and then pass it in an scrapy.Request
using the same parse
function to continue.
You can see if something has been printed in the website being visited using the response.body
, is something bad has happened then yield
a new scrapy.Request
if isn't then continue as normal.
Maybe:
def parse(self, response):
current_url = response.request.url
if 'Some or none message in the body' in response.body:
yield scrapy.Request(current_url, callback=self.parse)
else:
item = ContentsPageSFBItem()
content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()
for content_item in content_items:
item['content_item'] = content_item
item['full_url'] = response.url
item['title'] = response.xpath('//title[1]/text()').extract()
yield item
Note that the way you use again the parse
function depends heavily on which "exception" you want to catch.
Keeping in mind you want to write the data to different files, depending on the url you're, then I've tweaked a little bit the code:
For first creating three global variables to store the first and second url, and the fields as an array. Nnote this would be useful for those 2 urls but if they start growing that would be difficult:
global first_url, second_url, fields
fields = []
first_url = 'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/'
second_url = 'https://www.safaribooksonline.com/library/view/cisa-certified-information/9780134677453/'
start_urls = [first_url, second_url]
Then within your parse
function you get the data and store it in the fields
array, which will be passed to a second function parse_and_write_csv
, to create and write on every file depending on the current url.
def parse(self, response):
item = ContentsPageSFBItem()
content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()
url = response.request.url
for content_item in content_items:
item['content_item'] = content_item
item['full_url'] = response.url
item['title'] = response.xpath('//title[1]/text()').extract()
fields = [item['content_item'].encode('utf-8'), item['full_url'], item['title'][0]]
self.parse_and_write_csv(response, fields)
The parse_and_write_csv
get the fields and depending on the url it gets the 5th element from an array created from the url and creates a csv file or opens it if it already exists.
def parse_and_write_csv(self, response, fields):
with open("%s.csv" % response.request.url.split('/')[5], 'a+') as file:
file.write("{}\n".format(';'.join(str(field)
for field in fields)))
Hope it helps. You can see a gist here.
Upvotes: 1