Dawn.Sahil
Dawn.Sahil

Reputation: 105

Problem with scraping multiple pages with selenium webdriver - python

I am trying to scrape a webpage and the links within that webpage. The webpage is: https://webgate.ec.europa.eu/rasff-window/screen/list . If you notice there are about 6000+ notifications and these notifications have separate links associated with them. I want to store all the links in a list. I am doing this using this code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import time

from webdriver_manager.chrome import ChromeDriverManager


d = webdriver.Chrome(ChromeDriverManager().install())

#trying this scraping for multiple pages
links = []
i = 1
elems = d.find_elements_by_xpath("//a[@href]")
for elem in elems:
    link_list = elem.get_attribute("href")
    links.append(link_list)

while True:

  print("This is the now the {} page".format(i))
  i +=1
  time.sleep(1)
  try:
    time.sleep(0.5)
    WebDriverWait(d, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[@aria-label='Next page']"))).click()
    print("we have clicked it once")
    time.sleep(0.9)
    
    elems2 = d.find_elements_by_xpath("//a[@href]")
    for elem2 in elems2:
        link_list = elem2.get_attribute("href")
        links.append(link_list)
    print("The button is clickable")
    time.sleep(1)
  except:
    print("The button is now not clickable, we have collected all the links")
    break

The idea is to use selenium to first find all the href links from that page and click on the next page button and do the same, which my While loop does. But as I run this code it does not complete the entire loop. For ex: If there are about 6400 notifications I expect it to run till the 64th page, but it stops in the middle suggesting that the next button is not clickable (except condition) though the button in reality is clickable. This happens on random pages, I have tried changing the time.sleep as well. Is there something wrong that I am doing?

Upvotes: 0

Views: 290

Answers (1)

furas
furas

Reputation: 142631

I checked message from exception

except Exception as ex: 
     print(ex)

and it shows that problem is not button but href

It seems that sometimes it gets references to <a> before JavaScript updates all elements on page - and next when it tries to get href from <a> then error shows that this <a> doesn't exist on page because meanwhile JavaScript removed it and put new <a>.

And checking if button is clickable can be useless because it exists all time.

You should rather sleep longer before getting <a>. Or you would find better method to detect if you get new references or the same as before.

Upvotes: 1

Related Questions