Malte Susen
Malte Susen

Reputation: 845

Python Scraping from URL List

I am planning to open a CSV file containing a number of URLs I want to scrape. However, when running the request, I am only receiving a blank document as result. When entering the URLs directly into the Python code, things work well.

My thought is that something could be wrong with the CSV file as the code seems to be in line with what has worked for other users.

The CSV file, which is saved in the same folder as the scraper, is currently formatted as follows:

'https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws', 'https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2016%2Ccd_max%3A12%2F31%2F2016&tbm=nws', 'https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2017%2Ccd_max%3A12%2F31%2F2017&tbm=nws',

Does anyone have an idea what you be the reason for the blank document I am receiving? Please also find below the Python code I am using:

import scrapy

class TermSpider(scrapy.Spider):
    name = 'TermCheck'
    allowed_domains = ['google.com']

with open('urls.csv') as file:
    start_urls = [line.strip() for line in file]

    def parse(self, response):
        item = {
            'search_title': response.css('input#sbhost::attr(value)').get(),
            'results': response.css('#resultStats::text').get(),
            'url': response.url,
        }
        yield item

Please find below the run trace:

2019-02-11 15:30:19 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-02-11 15:30:19 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Darwin-18.0.0-x86_64-i386-64bit
2019-02-11 15:30:19 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'quote7.csv', 'SPIDER_LOADER_WARN_ONLY': True}
2019-02-11 15:30:19 [scrapy.extensions.telnet] INFO: Telnet Password: 9e547af3b30153c7
2019-02-11 15:30:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-02-11 15:30:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
        'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-11 15:30:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-11 15:30:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-02-11 15:30:19 [scrapy.core.engine] INFO: Spider opened
2019-02-11 15:30:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-11 15:30:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-02-11 15:30:19 [scrapy.core.scraper] ERROR: Error downloading <GET 'https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2015%2Ccd_max%3A12%2F31%2F2015&tbm=nws'>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
  defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 70, in download_request
(scheme, self._notconfigured[scheme]))
scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme
2019-02-11 15:30:19 [scrapy.core.scraper] ERROR: Error downloading <GET 'https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2016%2Ccd_max%3A12%2F31%2F2016&tbm=nws'>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
 defer.returnValue((yield download_func(request=request,spider=spider)))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 70, in download_request
(scheme, self._notconfigured[scheme]))
scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme
2019-02-11 15:30:19 [scrapy.core.scraper] ERROR: Error downloading <GET 'https://www.google.com/search?q=elon+musk&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2017%2Ccd_max%3A12%2F31%2F2017&tbm=nws'>
Traceback (most recent call last):
 File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
 File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
 return g.throw(self.type, self.value, self.tb)
 File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
 File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
 File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 70, in download_request
(scheme, self._notconfigured[scheme]))
 scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme
2019-02-11 15:30:19 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-11 15:30:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/scrapy.exceptions.NotSupported': 3,
 'downloader/request_bytes': 966,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 11, 15, 30, 19, 829530),
 'log_count/ERROR': 3,
 'log_count/INFO': 9,
 'memusage/max': 50180096,
 'memusage/startup': 50180096,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2019, 2, 11, 15, 30, 19, 485355)}
2019-02-11 15:30:19 [scrapy.core.engine] INFO: Spider closed (finished)

Upvotes: 0

Views: 203

Answers (1)

vezunchik
vezunchik

Reputation: 3717

Check indents in your class or refactor file reading to:

import scrapy

class TermSpider(scrapy.Spider):
    name = 'TermCheck'
    allowed_domains = ['google.com']

    def start_requests(self):
        with open('urls.csv') as file:
            for line in file.readlines():
                yield scrapy.Request(line.strip())

    def parse(self, response):
        item = {
            'search_title': response.css('input#sbhost::attr(value)').get(),
            'results': response.css('#resultStats::text').get(),
            'url': response.url,
        }
        yield item

Also please check your csv file and remove quotes and commas.

Upvotes: 3

Related Questions