Teodora
Teodora

Reputation: 21

Scrapy no such host crawler

I'm using this crawler as my base crowler https://github.com/alecxe/broken-links-checker/blob/master/broken_links_spider.py

It is created to catch 404 error domains and save them. I wanted to modify it a little bit and make it look for "No such host" error, which is error 12002.

However, with this code, Scrapy is not receiving any responce (because there isn't a host to return a responce) and when scrapy encounters such domains it returns

not found: [Errno 11001] getaddrinfo failed.

How can I catch this not found error and save the domains?

Upvotes: 2

Views: 152

Answers (2)

Rejected
Rejected

Reputation: 4491

Exceptions occurring during the processing of a request pass through the Downloader Middleware like Request and Response objects do, and are handled via the process_exception() method.

The following would log all exceptions (including when an IgnoreRequest is raised) to a log file

class ExceptionLog(Object):

    def process_exception(self, request, exception, spider):
        with open('exceptions.log', 'a') as f:
            f.write(str(exception) + "\n")  

Expand it to use of signals to call the usual spider_opened() and spider_closed() for better file handling, or to pass settings in from your settings.py file (such as a custom EXCEPTIONS_LOG = ...).

Add this to your DOWNLOADER_MIDDLEWARES dictionary in your settings file. Be mindful of where you put it in the chain of middleware, though! To close to the engine, and you may miss logging exceptions handled elsewhere. To far from the engine, and you may log exceptions that retry or are otherwise resolved. Where you put it will be based on where you need it.

Upvotes: 1

adam-asdf
adam-asdf

Reputation: 656

This isn't very elegant solution (it requires manual work) but it worked for me so let me mention it.

I used Scrapy to gather the links I wanted to check.

I then took that scraped data (in a CSV) and opened it in Sublime Text sanitized it (convert all to lower case, remove any malformed URLs etc.). I saved that file as plain text (.TXT) and used sort from a Bash shell: $ sort -u my-list-of-link.txt I then created another spider with those URLs listed as the start_urls.

I ran that spider and when it finished, I copied and pasted the logging output from my shell into a new file in Sublime Text. Then I did a 'find all' on the error code that interested me.

With all instances of the error code selected, I simply expanded the selections to the entire line(s), and copied and pasted that into another plain text file which amounted to a list of all the links/domains that returned the error code that interested me.

Upvotes: 0

Related Questions