Reputation: 5954
I've read about the StaleElementReferenceException in the official documentation, but I still don't understand why my code is raising this exception? Does browser.get()
instantiate a new spider?
class IndiegogoSpider(CrawlSpider):
name = 'indiegogo'
allowed_domains = [ 'indiegogo.com' ]
start_urls = [ 'https://www.indiegogo.com/explore/all?project_type=all&project_timing=all&sort=trending' ]
def parse(self, response):
if (response.status != 404):
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(firefox_options=options)
browser.get(self.start_urls[0])
show_more = WebDriverWait(browser, 10).until(
EC.element_to_be_clickable((By.XPATH, '//div[@class="text-center"]/a'))
)
while True:
try:
show_more.click()
except Exception:
break
hrefs = WebDriverWait(browser, 60).until(
EC.visibility_of_all_elements_located((By.XPATH, '//div[@class="discoverableCard"]/a'))
)
for href in hrefs:
browser.get(href.get_attribute('href'))
#
# will be scraping individual pages here
#
browser.close()
I've tried the following to no avail. I've also tried placing the links
variable elsewhere in the script, in a different scope, also to no avail.
links = []
for href in hrefs:
links.append(href.get_attribute('href'))
for link in links:
browser.get(href.get_attribute('href'))
#
# will be scraping individual pages here
#
Not sure why hrefs
and especially links
are erased from memory? When I extract the value of the href
attribute of each item in the hrefs
iterable, and then stick all of the URLs in the links
variable, shouldn't the links
list be independent of the DOM and page changes?
Not sure what to do at this point. Any ideas?
Upvotes: 1
Views: 279
Reputation: 5647
As documentaion says:
A stale element reference exception is thrown in one of two cases, the first being more common than the second:
In your case it is:
It is because of browser.get(href.get_attribute('href'))
. When you are redirecting to the another page, your DOM will be completely reloaded and the hrefs
does not reference to the elements from previous page. That's why you are getting an error.
How to deal with this error? You can do like this:
links = []
for href in hrefs: # store all links as a strings
links.append(href.get_attribute('href'))
for link in links: # then just use them
browser.get(link)
Upvotes: 2
Reputation: 1001
@Anthony, your second code block with links
should work, it just looks like you have a copy/paste bug:
for link in links:
browser.get(href.get_attribute('href'))
should be
for link in links:
browser.get(link)
...
Upvotes: 1