Reputation: 2126
UPD: Not close question because I think my way is not so clear as should be
Is it possible to get current request + response + download time for saving it to Item?
In "plain" python I do
start_time = time()
urllib2.urlopen('http://example.com').read()
time() - start_time
But how i can do this with Scrapy?
UPD:
Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time
may be wrong (even DOWNLOAD_TIMEOUT * 3)
For
settings.py
DOWNLOADER_MIDDLEWARES = {
'myscraper.middlewares.DownloadTimer': 0,
}
middlewares.py
from time import time
from scrapy.http import Response
class DownloadTimer(object):
def process_request(self, request, spider):
request.meta['__start_time'] = time()
# this not block middlewares which are has greater number then this
return None
def process_response(self, request, response, spider):
request.meta['__end_time'] = time()
return response # return response coz we should
def process_exception(self, request, exception, spider):
request.meta['__end_time'] = time()
return Response(
url=request.url,
status=110,
request=request)
inside spider.py in def parse(...
log.msg('Download time: %.2f - %.2f = %.2f' % (
response.meta['__end_time'], response.meta['__start_time'],
response.meta['__end_time'] - response.meta['__start_time']
), level=log.DEBUG)
Upvotes: 14
Views: 4263
Reputation: 1
I think the best solution is by using scrapy signals. Whenever the request reaches the downloader it emits request_reached_downloader signal. After download it emits response_downloaded signal. You can catch it from the spider and assign time and its differences to meta from there.
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SignalSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
return spider
More elaborate answer is on here
Upvotes: 0
Reputation: 360
Not sure if you need a Middleware here. Scrapy has a request.meta which you can query and yield. For download latency, simply yield
download_latency=response.meta.get('download_latency'),
The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
Upvotes: 8
Reputation: 2254
You could write a Downloader Middleware which would time each request. It would add a start time to the request before it's made and then a finish time when it's finished. Typically, arbitrary data such as this is stored in the Request.meta attribute. This timing information could later be read by your spider and added to your item.
This downloader middleware sounds like it could be useful on many projects.
Upvotes: 7