Reputation: 739
I'm a newbie of scrapy and it's amazing crawler framework i have known!
In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.
2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,
'downloader/request_bytes': 46282582,
'downloader/request_count': 92383,
'downloader/request_method_count/GET': 92383,
'downloader/response_bytes': 123766459,
'downloader/response_count': 92382,
'downloader/response_status_count/200': 92382,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),
'item_scraped_count': 46191,
'request_depth_max': 1,
'scheduler/memory_enqueued': 92383,
'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}
Is there any way to get more detail report? For example, show those failed URLs. Thanks!
Upvotes: 53
Views: 41445
Reputation: 7889
Yes, this is possible.
failed_urls
list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required). from scrapy import Spider, signals
class MySpider(Spider):
handle_httpstatus_list = [404]
name = "myspider"
allowed_domains = ["example.com"]
start_urls = [
'http://www.example.com/thisurlexists.html',
'http://www.example.com/thisurldoesnotexist.html',
'http://www.example.com/neitherdoesthisone.html'
]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.failed_urls = []
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed)
return spider
def parse(self, response):
if response.status == 404:
self.crawler.stats.inc_value('failed_url_count')
self.failed_urls.append(response.url)
def handle_spider_closed(self, reason):
self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls))
def process_exception(self, response, exception, spider):
ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)
Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown - I simulated them by trying to run the spider after I'd turned off my wireless adapter):
2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 15,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,
'downloader/request_bytes': 717,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 15209,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 2,
'failed_url_count': 2,
'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),
'log_count/DEBUG': 9,
'log_count/ERROR': 2,
'log_count/INFO': 4,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'spider_exceptions/NameError': 2,
'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}
Upvotes: 59
Reputation: 462
Scrapy ignores 404 by default and does not parse it. If you are getting an error code 404 in response, you can handle this with a very easy way.
In settings.py, write:
HTTPERROR_ALLOWED_CODES = [404,403]
And then handle the response status code in your parse function:
def parse(self,response):
if response.status == 404:
#your action on error
Upvotes: 19
Reputation: 31
Basically Scrapy Ignores 404 Error by Default, It was defined in httperror middleware.
So, Add HTTPERROR_ALLOW_ALL = True to your settings file.
After this you can access response.status through your parse function.
You can handle it like this.
def parse(self,response):
if response.status==404:
print(response.status)
else:
do something
Upvotes: 3
Reputation: 3661
The answers from @Talvalin and @alecxe helped me a great deal, but they do not seem to capture downloader events that do not generate a response object (for instance, twisted.internet.error.TimeoutError
and twisted.web.http.PotentialDataLoss
). These errors show up in the stats dump at the end of the run, but without any meta info.
As I found out here, the errors are tracked by the stats.py middleware, captured in the DownloaderStats
class' process_exception
method, and specifically in the ex_class
variable, which increments each error type as necessary, and then dumps the counts at the end of the run.
To match such errors with information from the corresponding request object, you can add a unique id to each request (via request.meta
), then pull it into the process_exception
method of stats.py
:
self.stats.set_value('downloader/my_errs/{0}'.format(request.meta), ex_class)
That will generate a unique string for each downloader-based error not accompanied by a response. You can then save the altered stats.py
as something else (e.g. my_stats.py
), add it to the downloadermiddlewares (with the right precedence), and disable the stock stats.py
:
DOWNLOADER_MIDDLEWARES = {
'myproject.my_stats.MyDownloaderStats': 850,
'scrapy.downloadermiddleware.stats.DownloaderStats': None,
}
The output at the end of the run looks like this (here using meta info where each request url is mapped to a group_id and member_id separated by a slash, like '0/14'
):
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web.http.PotentialDataLoss': 3,
'downloader/my_errs/0/1': 'twisted.web.http.PotentialDataLoss',
'downloader/my_errs/0/38': 'twisted.web.http.PotentialDataLoss',
'downloader/my_errs/0/86': 'twisted.web.http.PotentialDataLoss',
'downloader/request_bytes': 47583,
'downloader/request_count': 133,
'downloader/request_method_count/GET': 133,
'downloader/response_bytes': 3416996,
'downloader/response_count': 130,
'downloader/response_status_count/200': 95,
'downloader/response_status_count/301': 24,
'downloader/response_status_count/302': 8,
'downloader/response_status_count/500': 3,
'finish_reason': 'finished'....}
This answer deals with non-downloader-based errors.
Upvotes: 14
Reputation: 93
You can capture failed urls in two ways.
Define scrapy request with errback
class TestSpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(url, callback=self.parse, errback=self.errback)
def errback(self, failure):
'''handle failed url (failure.request.url)'''
pass
Use signals.item_dropped
class TestSpider(scrapy.Spider):
def __init__(self):
crawler.signals.connect(self.request_dropped, signal=signals.request_dropped)
def request_dropped(self, request, spider):
'''handle failed url (request.url)'''
pass
[!Notice] Scrapy request with errback can not catch some auto retry failure, like connection error, RETRY_HTTP_CODES in settings.
Upvotes: 4
Reputation: 385
In addition to some of these answers, if you want to track Twisted errors, I would take a look at using the Request object's errback
parameter, on which you can set a callback function to be called with the Twisted Failure on a request failure.
In addition to the url, this method can allow you to track the type of failure.
You can then log the urls by using: failure.request.url
(where failure
is the Twisted Failure
object passed into errback
).
# these would be in a Spider
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse,
errback=self.handle_error)
def handle_error(self, failure):
url = failure.request.url
logging.error('Failure type: %s, URL: %s', failure.type,
url)
The Scrapy docs give a full example of how this can be done, except that the calls to the Scrapy logger are now depreciated, so I've adapted my example to use Python's built in logging):
https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks
Upvotes: 3
Reputation: 474003
Here's another example how to handle and collect 404 errors (checking github help pages):
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field
class GitHubLinkItem(Item):
url = Field()
referer = Field()
status = Field()
class GithubHelpSpider(CrawlSpider):
name = "github_help"
allowed_domains = ["help.github.com"]
start_urls = ["https://help.github.com", ]
handle_httpstatus_list = [404]
rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)
def parse_item(self, response):
if response.status == 404:
item = GitHubLinkItem()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['status'] = response.status
return item
Just run scrapy runspider
with -o output.json
and see list of items in the output.json
file.
Upvotes: 20
Reputation: 163
This is an update on this question. I ran in to a similar problem and needed to use the scrapy signals to call a function in my pipeline. I have edited @Talvalin's code, but wanted to make an answer just for some more clarity.
Basically, you should add in self as an argument for handle_spider_closed. You should also call the dispatcher in init so that you can pass the spider instance (self) to the handleing method.
from scrapy.spider import Spider
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
class MySpider(Spider):
handle_httpstatus_list = [404]
name = "myspider"
allowed_domains = ["example.com"]
start_urls = [
'http://www.example.com/thisurlexists.html',
'http://www.example.com/thisurldoesnotexist.html',
'http://www.example.com/neitherdoesthisone.html'
]
def __init__(self, category=None):
self.failed_urls = []
# the dispatcher is now called in init
dispatcher.connect(self.handle_spider_closed,signals.spider_closed)
def parse(self, response):
if response.status == 404:
self.crawler.stats.inc_value('failed_url_count')
self.failed_urls.append(response.url)
def handle_spider_closed(self, spider, reason): # added self
self.crawler.stats.set_value('failed_urls',','.join(spider.failed_urls))
def process_exception(self, response, exception, spider):
ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)
I hope this helps anyone with the same problem in the future.
Upvotes: 5
Reputation: 151441
As of scrapy 0.24.6, the method suggested by alecxe won't catch errors with the start URLs. To record errors with the start URLs you need to override parse_start_urls
. Adapting alexce's answer for this purpose, you'd get:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field
class GitHubLinkItem(Item):
url = Field()
referer = Field()
status = Field()
class GithubHelpSpider(CrawlSpider):
name = "github_help"
allowed_domains = ["help.github.com"]
start_urls = ["https://help.github.com", ]
handle_httpstatus_list = [404]
rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)
def parse_start_url(self, response):
return self.handle_response(response)
def parse_item(self, response):
return self.handle_response(response)
def handle_response(self, response):
if response.status == 404:
item = GitHubLinkItem()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['status'] = response.status
return item
Upvotes: 5