cesalomx
cesalomx

Reputation: 47

Crawled 0 pages, scraped 0 items ERROR / webscraping / SELENIUM

So I've tried several things to understand why my spider is failing, but haven't suceeded. I've been stuck for days now and can't afford to keep putting this off any longer. I just want to scrape the very first page, not doing pagination at this time. I'd highly appreciate your help :( This is my code:

import scrapy
from scrapy_selenium import SeleniumRequest



class HomesSpider(scrapy.Spider):
    name = 'homes'

    def parse(self, response):
        yield SeleniumRequest(
            url='https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-2/v1c1097l1021p2',
            wait_time=3,
            callback=self.parse
        )
    
    def parse(self, response):
        homes = response.xpath("//div[@class='viewport-contents']/div")
        for home in homes:
            yield{
                'price': home.xpath(".//span[@class='value wrapper']/span[@class='ad-price']/text()").get(),
                'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(),
                'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
                'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
                'm2': home.xpath(".//div[@class='chiplets-inline-block surface-area']/text()").get()
            }

This is my settings.py file:

# Scrapy settings for real_state project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'real_state'

SPIDER_MODULES = ['real_state.spiders']
NEWSPIDER_MODULE = 'real_state.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'real_state (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'User_Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'real_state.middlewares.RealStateSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'real_state.pipelines.RealStatePipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#SELENIUM
from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which("C:\\Users\\Cesal\\projects\\real_state\\chromedriver.exe")
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

This is what I get in the terminal when I execute it:

(base) PS C:\Users\Cesal\projects\real_state\real_state\spiders> scrapy crawl homes
2021-11-03 13:02:58 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: real_state)
2021-11-03 13:02:58 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 21.2.0, Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.19041-SP0
2021-11-03 13:02:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-03 13:02:58 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'real_state',
 'NEWSPIDER_MODULE': 'real_state.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['real_state.spiders']}
2021-11-03 13:02:58 [scrapy.extensions.telnet] INFO: Telnet Password: ade49fc0492d5027
2021-11-03 13:02:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-11-03 13:02:59 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:64533/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "platformName": "any", "goog:chromeOptions": {"extensions": [], "args": ["-headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "version": "", "platform": "ANY", "goog:chromeOptions": {"extensions": [], "args": ["-headless"]}}}
2021-11-03 13:02:59 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:64533

DevTools listening on ws://127.0.0.1:64541/devtools/browser/302904be-ca13-4464-a332-8d995cb55f44
2021-11-03 13:03:00 [urllib3.connectionpool] DEBUG: http://127.0.0.1:64533 "POST /session HTTP/1.1" 200 788
2021-11-03 13:03:00 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-03 13:03:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy_selenium.SeleniumMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-11-03 13:03:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-11-03 13:03:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-11-03 13:03:00 [scrapy.core.engine] INFO: Spider opened
2021-11-03 13:03:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-03 13:03:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-11-03 13:03:00 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-03 13:03:00 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:64533/session/5a6d8196d99d431b8b786f3f24688d84 {}
2021-11-03 13:03:00 [urllib3.connectionpool] DEBUG: http://127.0.0.1:64533 "DELETE /session/5a6d8196d99d431b8b786f3f24688d84 HTTP/1.1" 200 14
2021-11-03 13:03:00 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-11-03 13:03:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.005515,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 11, 3, 19, 3, 0, 889623),
 'log_count/DEBUG': 7,
 'log_count/INFO': 10,
 'start_time': datetime.datetime(2021, 11, 3, 19, 3, 0, 884108)}
2021-11-03 13:03:02 [scrapy.core.engine] INFO: Spider closed (finished)
(base) PS C:\Users\Cesal\projects\real_state\real_state\spiders>

Upvotes: -1

Views: 104

Answers (1)

ShoGinn
ShoGinn

Reputation: 48

I think your error is that you are trying to parse instead of starting the requests.

Change:

def parse(self, response):
    yield SeleniumRequest(
        url='https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-2/v1c1097l1021p2',
        wait_time=3,
        callback=self.parse
    )

to:

def start_requests(self):
    yield SeleniumRequest(
        url='https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-2/v1c1097l1021p2',
        wait_time=3,
        callback=self.parse
    )

Upvotes: 1

Related Questions