Reputation: 1
I am trying to integrate selenium with scrapy to render javascript from a website. I have put the selenium automation code in a constructor, it performs a button click, and then the parse function scrapes the data from the page. But follwing errors are appearing in the terminal window. code :
import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
class test_2(scrapy.Spider):
name='test_2'
#allowed_domains=[]
start_urls=[
'https://www.jackjones.in/st-search?q=shoes'
]
def _init_(self):
print("test-1")
chrome_options=Options()
chrome_options.add_argument("--headless")
driver=webdriver.Chrome("C:/chromedriver")
driver.set_window(1920,1080)
driver.get("https://www.jackjones.in/st-search?q=shoes")
tab=driver.find_elements_by_class_name("st-single-product")
tab[4].click()
self.html=driver.page_source
print("test-2")
driver.close()
def parse(self, response):
print("test-3")
resp=Selector(text=self.html)
yield{
'title':resp.xpath("//h1/text()").get()
}
It appears that compiler does not execute the init function before going to parse function, because neither of the print statements are getting executed but the print statement in parse function is present in the output.
How to fix this?
Output:
PS C:\Users\Vasu\summer\scrapy_selenium> scrapy crawl test_2
2022-07-01 13:18:30 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapy_selenium)
2022-07-01 13:18:30 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0,
w3lib 1.22.0, Twisted 22.4.0, Python 3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL
22.0.0 (OpenSSL 1.1.1p 21 Jun 2022), cryptography 37.0.1, Platform Windows-10-10.0.19044-SP0
2022-07-01 13:18:30 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_selenium',
'NEWSPIDER_MODULE': 'scrapy_selenium.spiders',
'SPIDER_MODULES': ['scrapy_selenium.spiders']}
2022-07-01 13:18:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-07-01 13:18:30 [scrapy.extensions.telnet] INFO: Telnet Password: 168b57499cd07735
2022-07-01 13:18:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-07-01 13:18:31 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-07-01 13:18:31 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-07-01 13:18:31 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-07-01 13:18:31 [scrapy.core.engine] INFO: Spider opened
2022-07-01 13:18:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-01 13:18:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-01 13:18:31 [filelock] DEBUG: Attempting to acquire lock 1385261511056 on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-01 13:18:31 [filelock] DEBUG: Lock 1385261511056 acquired on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-01 13:18:32 [filelock] DEBUG: Attempting to release lock 1385261511056 on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-01 13:18:32 [filelock] DEBUG: Lock 1385261511056 released on C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-01 13:18:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jackjones.in/st-search?q=shoes> (referer: None)
test-3
2022-07-01 13:18:32 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.jackjones.in/st-search?q=shoes> (referer: None)
Traceback (most recent call last):
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\utils\defer.py", line 132, in iter_errback
yield next(it)
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\utils\python.py", line 354, in __next__
return next(self.data)
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\utils\python.py", line 354, in __next__
return next(self.data)
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\Vasu\anaconda3\envs\sca_sel\lib\site-packages\scrapy\core\spidermw.py", line 66, in _evaluate_iterable
for r in iterable:
File "C:\Users\Vasu\summer\scrapy_selenium\scrapy_selenium\spiders\test_2.py", line 31, in parse
resp=Selector(text=self.html)
AttributeError: 'test_2' object has no attribute 'html'
2022-07-01 13:18:32 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-01 13:18:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 237,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 20430,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.613799,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 7, 1, 7, 48, 32, 202155),
'httpcompression/response_bytes': 87151,
'httpcompression/response_count': 1,
'log_count/DEBUG': 6,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2022, 7, 1, 7, 48, 31, 588356)}
2022-07-01 13:18:32 [scrapy.core.engine] INFO: Spider closed (finished)
Upvotes: -1
Views: 64
Reputation: 726
It's __init__
, not _init_
(note the double underscores).
Secondly, there is no h1
on the page. Try this instead:
yield {
'title':resp.xpath("//title/text()").get()
}
Upvotes: 0