Reputation: 11
I am new to scrapy and am trying to scrape the title for the following website https://www.mdcalc.com/heart-score-major-cardiac-events
I reviewed all the previous posts on this subject but am still getting then open ssl error
Here is my code: settings.py
DOWNLOADER_CLIENTCONTEXTFACTORY ='scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
Here is the code for my spider
import scrapy
from skitter.items import SkitterItem
class mdcalc(scrapy.Spider):
name = "mdcalc"
allowed_domains = "mdcalc.com"
start_urls = ['https://www.mdcalc.com/heart-score-major-cardiac-events']
def parse(self, response) :
item = SkitterItem()
item['title'] = response.xpath('//h1//text()').extract()[0]
yield item
When I run
curl localhost:6800/schedule.json -d project=skitter -d spider=mdcalc
Here is the error I get
2017-09-27 02:02:23+0000 [scrapy] INFO: Scrapy 0.24.6 started (bot: skitter)
2017-09-27 02:02:23+0000 [scrapy] INFO: Optional features available: ssl,
http11
2017-09-27 02:02:23+0000 [scrapy] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'skitter.spiders', 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES':
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled extensions: FeedExporter,
LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled downloader middlewares:
RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware,
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware,
MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware,
CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
UrlLengthMiddleware, DepthMiddleware
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled item pipelines:
ElasticSearchPipeline
2017-09-27 02:02:23+0000 [mdcalc] INFO: Spider opened
2017-09-27 02:02:23+0000 [mdcalc] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2017-09-27 02:02:23+0000 [scrapy] DEBUG: Telnet console listening on
127.0.0.1:6024
2017-09-27 02:02:23+0000 [scrapy] DEBUG: Web service listening on
127.0.0.1:6081
2017-09-27 02:02:23+0000 [mdcalc] DEBUG: Retrying <GET
https://www.mdcalc.com/robots.txt> (failed 1 times):
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:27+0000 [mdcalc] DEBUG: Retrying <GET
https://www.mdcalc.com/heart-score-major-cardiac-events> (failed 1 times):
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:32+0000 [mdcalc] DEBUG: Retrying <GET
https://www.mdcalc.com/robots.txt> (failed 2 times):
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:38+0000 [mdcalc] DEBUG: Retrying <GET
https://www.mdcalc.com/heart-score-major-cardiac-events> (failed 2 times):
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:45+0000 [mdcalc] DEBUG: Gave up retrying <GET
https://www.mdcalc.com/robots.txt> (failed 3 times):
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:45+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client]
ERROR: Unhandled error in Deferred:
2017-09-27 02:02:45+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client]
Unhandled Error
Traceback (most recent call last):
Failure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:52+0000 [mdcalc] DEBUG: Gave up retrying <GET https://www.mdcalc.com/heart-score-major-cardiac-events> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:52+0000 [mdcalc] ERROR: Error downloading <GET https://www.mdcalc.com/heart-score-major-cardiac-events>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:52+0000 [mdcalc] INFO: Closing spider (finished)
2017-09-27 02:02:52+0000 [mdcalc] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived':
6,
'downloader/request_bytes': 1614,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 27, 2, 2, 52, 62313),
'log_count/DEBUG': 8,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2017, 9, 27, 2, 2, 23, 380740)}
2017-09-27 02:02:52+0000 [mdcalc] INFO: Spider closed (finished)
Thanks in advance for your help.
Upvotes: 1
Views: 2036
Reputation: 9
it's because of the version of python that scrapinghub cloud runs by default, which is 2.7, to fix that you have to specify which version of python your spider must use, python3, this link explain how to do it. https://support.scrapinghub.com/support/solutions/articles/22000200387-deploying-python-3-spiders-to-scrapy-cloud
Upvotes: 0