gr3yh47
gr3yh47

Reputation: 51

Scrapy twisted connection lost in non-clean fashion. No proxy. Already tried headers

I am trying to crawl this site

https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs

with scrapy and keep getting twisted request/disconnection errors. I am not using a proxy and I tried both setting the user agent and actually setting all the headers based on this answer

here is the code generating the request

def start_requests(self):
    url = 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs'

    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive',
        'DNT': '1',
        'Host': 'www5.apply2jobs.com',
        'Referer': 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=2524&CurrentPage=2',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'
    }

    yield Request(url=url, headers=headers, callback=self.parse)

and this is my traceback:

2017-08-28 13:34:13 [scrapy.core.engine] INFO: Spider opened
2017-08-28 13:34:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-28 13:34:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www5.apply2jobs.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.core.scraper] ERROR: Error downloading <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.core.engine] INFO: Closing spider (finished)

Upvotes: 2

Views: 15644

Answers (1)

gr3yh47
gr3yh47

Reputation: 51

So thanks to discussion on github as well as in the comments to my question, it looks like the best course of action is to use a virtualenv with cryptography<2

credit to @paultrmbrth for helping so much

I tried compiling OpenSSL 1.1.0f with enable-weak-ssl-ciphers to build a static wheel, but I didn't manage to have it support TLS_RSA_WITH_RC4_128_MD5 (as ssllabs.com reports) for some reason. I'm laking OpenSSL building knowledge apparently. So the only option I see is to use a virtualenv with 'cryptography<2' for scraping that website.

Upvotes: 3

Related Questions