kévin Goncalves
kévin Goncalves

Reputation: 27

Scrapy-Selenium returns an empty html body

I am running scrapy-selenium script but I printed the page source actually empty. I don't get 403 error or others.

spider.py:

import scrapy
import random
from scrapy_selenium import SeleniumRequest
from scrapy.selector import Selector
from selenium import webdriver



USERNAME = '******'
PASSWORD = '*******'


class ApiPbSpider(scrapy.Spider):
name = 'api_pb'


def new_proxy(self):
    sessionid = f"session{random.randint(1,100)}"
    self.proxy = ('https://customer*****************************' %(USERNAME,sessionid, PASSWORD)) 

def start_requests(self):
    self.new_proxy()

    yield SeleniumRequest(
        url = 'https://www.pagesjaunes.fr/pagesblanches/recherche?quoiqui=marie&ou=Civray+%2886400%29&univers=pagesblanches&idOu=L08607800',
        wait_time = 15,
        callback=self.parse,

    )

def parse(self, response):
    driver = response.meta['driver']
    code_page = driver.page_source
    print(code_page)

In my setting, I put some options which they helped me to bypass 403 error and have code 200. settings.py:

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless', "--incognito", "--nogpu", "--disable-gpu", "--window-size=1280,1280",  "--no-sandbox", "--enable-javascript", '--disable-blink-features=AutomationControlled', f"--proxy-server={proxy}"]  # '--headless' if using chrome instead of firefox

So, I get en empty body :

    DevTools listening onws://127.0.0.1:57690/devtools/browser/377c991f-9c34-406d-a05c-aa01a4198bfe
2024-04-25 09:56:38 [urllib3.connectionpool] DEBUG: http://127.0.0.1:57684 "POST /session HTTP/1.1" 200 895
2024-04-25 09:56:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2024-04-25 09:56:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy_selenium.SeleniumMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-25 09:56:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-25 09:56:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-25 09:56:38 [scrapy.core.engine] INFO: Spider opened
2024-04-25 09:56:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-25 09:56:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:57684/session/a420b263b967f767ba30a84393aa8e12/url {"url": "https://www.pagesjaunes.fr/pagesblanches/recherche?quoiqui=marie&ou=Civray+%2886400%29&univers=pagesblanches&idOu=L08607800"}
2024-04-25 09:56:39 [urllib3.connectionpool] DEBUG: http://127.0.0.1:57684 "POST /session/a420b263b967f767ba30a84393aa8e12/url HTTP/1.1" 200 14
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:57684/session/a420b263b967f767ba30a84393aa8e12/source {}
2024-04-25 09:56:39 [urllib3.connectionpool] DEBUG: http://127.0.0.1:57684 "GET /session/a420b263b967f767ba30a84393aa8e12/source HTTP/1.1" 200 81
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:57684/session/a420b263b967f767ba30a84393aa8e12/url {}
2024-04-25 09:56:39 [urllib3.connectionpool] DEBUG: http://127.0.0.1:57684 "GET /session/a420b263b967f767ba30a84393aa8e12/url HTTP/1.1" 200 135
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2024-04-25 09:56:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.pagesjaunes.fr/pagesblanches/recherche?quoiqui=marie&ou=Civray+%2886400%29&univers=pagesblanches&idOu=L08607800> (referer: https://www.pagesjaunes.fr/pagesblanches/)
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:57684/session/a420b263b967f767ba30a84393aa8e12/source {}
2024-04-25 09:56:39 [urllib3.connectionpool] DEBUG: http://127.0.0.1:57684 "GET /session/a420b263b967f767ba30a84393aa8e12/source HTTP/1.1" 200 81
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
<html><head></head><body></body></html>
2024-04-25 09:56:39 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:57684/session/a420b263b967f767ba30a84393aa8e12 {}
2024-04-25 09:56:39 [urllib3.connectionpool] DEBUG: http://127.0.0.1:57684 "DELETE /session/a420b263b967f767ba30a84393aa8e12 HTTP/1.1" 200 14
2024-04-25 09:56:39 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2024-04-25 09:56:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 58,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.295228,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 4, 25, 7, 56, 39, 778231, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 21,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 4, 25, 7, 56, 39, 483003, tzinfo=datetime.timezone.utc)}
2024-04-25 09:56:41 [scrapy.core.engine] INFO: Spider closed (finished)

Also, I tried to add desired_capabilities like "acceptInsecureCerts" but it did not work too.

Upvotes: 0

Views: 61

Answers (0)

Related Questions