Reputation: 36307
I'm trying out a scrapy package https://github.com/clemfromspace/scrapy-selenium.
I've followed the directions on the main github page above. I started a new scrapy project and created a spider:
from scrapy_selenium import SeleniumRequest
from shutil import which
SELENIUM_DRIVER_NAME = 'firefox'
# SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0'
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'test'
def start_requests(self):
for url in self.start_urls:
yield SeleniumRequest(url, self.parse_index_page)
def parse_index_page(self, response):
....
The output:
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-07-05 14:14:44 [scrapy.middleware] WARNING: Disabled SeleniumMiddleware: SELENIUM_DRIVER_NAME and SELENIUM_DRIVER_EXECUTABLE_PATH must be set
2019-07-05 14:14:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
........
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-05 14:14:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-05 14:14:44 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-07-05 14:14:44 [scrapy.core.engine] INFO: Spider opened
2019-07-05 14:14:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-05 14:14:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-05 14:14:44 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "....\splashtest\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "E:\ENVS\r3\scrapySelenium\scrapySelenium\spiders\test.py", line 55, in start_requests
yield SeleniumRequest(url, self.parse_index_page)
File "....\splashtest\lib\site-packages\scrapy_selenium\http.py", line 32, in __init__
super().__init__(*args, **kwargs)
TypeError: __init__() missing 1 required positional argument: 'url'
2019-07-05 14:14:44 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-05 14:14:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 7, 5, 18, 14, 44, 74274),
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'start_time': datetime.datetime(2019, 7, 5, 18, 14, 44, 66256)}
2019-07-05 14:14:44 [scrapy.core.engine] INFO: Spider closed (finished)
What am I doing wrong?
Upvotes: 1
Views: 1294
Reputation: 1711
From the link in the question, the scrapy_selenium.SeleniumRequest
constructor takes initial arguments wait_time
, wait_until
, screenshot
, and script
, passing any remaining arguments along to the scrapy.Request
constructor.
The posted code instantiates a SeleniumRequest
with two positional arguments, which means that no arguments will be passed to the Request
constructor, causing the error in question.
To fix this, you can either explicitly pass the default SeleniumRequest
arguments by position:
yield SeleniumRequest(None, None, False, None, url, self.parse_index_page)
or pass the Request
arguments by keyword (generally a better style, both because it's resilient against changes to the default arguments and because it's far easier to determine what's going on):
yield SeleniumRequest(url=url, callback=self.parse_index_page)
Upvotes: 2