Reputation: 241
I am trying to crawl multiple websites using Scrapy link extractor and follow as TRUE (recursive) .. Looking for a solution to set the time limit to crawl for each url in start_urls list.
Thanks
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
Upvotes: 2
Views: 7362
Reputation: 21406
You need to use download_timeout
meta parameter for scrapy.Request
.
To use it in starting urls, you need to override self.start_requests(self)
function, something like:
def start_requests(self):
# 10 seconds for first url
yield Request(self.start_urls[0], meta={'download_timeout': 10})
# 60 seconds for first url
yield Request(self.start_urls[1], meta={'download_timeout': 60})
You can read more about Request special meta keys here: http://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys
Upvotes: 5
Reputation: 766
You can use the CLOSESPIDER_TIMEOUT
setting
For example, call your spider like this:
scrapy crawl DmozSpider -s CLOSESPIDER_TIMEOUT=10
Upvotes: 0
Reputation: 10779
Use a Timeout object!
import signal
class Timeout(object):
"""Timeout class using ALARM signal."""
class TimeoutError(Exception):
pass
def __init__(self, sec):
self.sec = sec
def __enter__(self):
signal.signal(signal.SIGALRM, self.raise_timeout)
signal.alarm(self.sec)
def __exit__(self, *args):
signal.alarm(0)# disable alarm
def raise_timeout(self, *args):
raise Timeout.TimeoutError('TimeoutError')
Then you can call your extractor inside a with statement like this:
with Timeout(10): #10 seconds
try:
do_what_you_need_to_do
except Timeout.TimeoutError:
#break, continue or whatever else you may need
Upvotes: -2