Reputation: 458
I am trying to save schedules of basketball teams to a CSV file using Scrapy. I have written the following code in these files:
BOT_NAME = 'test_project'
SPIDER_MODULES = ['test_project.spiders']
NEWSPIDER_MODULE = 'test_project.spiders'
FEED_URI = "cportboys.csv"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'test_project (+'
# Obey robots.txt rules
import scrapy
class KhsaabotSpider(scrapy.Spider):
name = 'khsaabot'
allowed_domains = ['']
start_urls = ['http://']
def parse(self, response):
date = response.css('.mdate::text').extract()
opponent = response.css('.opponent::text').extract()
place = response.css('.schedule-loc::text').extract()
for item in zip(date,opponent,place):
scraped_info = {
'date' : item[0],
'opponent' : item[1],
'place' : item[2],
yield scraped_info
Now, I am not sure what is going wrong here, when I run it in the terminal using "scrapy crawl khsaabot" it gives no errors, and appears to be working just fine. However, just in case there is a problem with what is happening in the terminal, I included the output that I got there too:
2017-12-27 17:21:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: test_project)
2017-12-27 17:21:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'test_project', 'FEED_FORMAT': 'csv', 'FEED_URI': 'cportboys.csv', 'NEWSPIDER_MODULE': 'test_project.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['test_project.spiders']}
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled extensions:
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled spider middlewares:
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled item pipelines:
2017-12-27 17:21:49 [scrapy.core.engine] INFO: Spider opened
2017-12-27 17:21:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-27 17:21:49 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://https/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://https/robots.txt>: DNS lookup failed: no results for hostname lookup: https.
Traceback (most recent call last):
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/python/", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/scrapy/core/downloader/", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/", line 954, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//> (failed 1 times): DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//> (failed 2 times): DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://https//> (failed 3 times): DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.core.scraper] ERROR: Error downloading <GET http://https//>
Traceback (most recent call last):
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/python/", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/scrapy/core/downloader/", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/", line 954, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https.
2017-12-27 17:21:49 [scrapy.core.engine] INFO: Closing spider (finished)
2017-12-27 17:21:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 6,
'downloader/request_bytes': 1416,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 12, 27, 23, 21, 49, 579649),
'log_count/DEBUG': 7,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'memusage/max': 50790400,
'memusage/startup': 50790400,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.internet.error.DNSLookupError': 4,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2017, 12, 27, 23, 21, 49, 323652)}
2017-12-27 17:21:49 [scrapy.core.engine] INFO: Spider closed (finished)
The output looks right to me, but I am still new to Scrapy so I could be missing something.
Thanks y'all
Upvotes: 0
Views: 934
Reputation: 2198
you are getting twisted.internet.error.DNSLookupError
messages in the log. looking at your start_urls
list, the item starts with "http://https://". change:
start_urls = ['http://']
start_urls = ['']
Upvotes: 5