SIM
SIM

Reputation: 22440

Can't get rid of hardcoded delay even when Explicit Wait is already there

I've written some code in python in combination with selenium to parse the different questions from quora.com. My scraper is doing it's job at this moment. The thing is I've used here hardcoded delay for the scraper to work, even when Explicit Wait has already been defined. As the page is an infinite scrolling one, i tried to make the scrolling process to a limited number. Now, I have got two questions:

  1. Why wait.until(EC.staleness_of(page)) is not working within my scraper. It is commented out now.
  2. If i use something else instead of page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link"))) the scraper throws an error: can't focus element.

Btw, I do not wish to go for page = driver.find_element_by_tag_name('body') this option.

Here is what I've written so far:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)

page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
for scroll in range(10):
    page.send_keys(Keys.PAGE_DOWN)
    time.sleep(2)
    # wait.until(EC.staleness_of(page))

for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
    print(item.text)

driver.quit()

Upvotes: 1

Views: 95

Answers (1)

Andersson
Andersson

Reputation: 52695

You can try below code to get as much XHR as possible and then parse the page:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)

page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
links_counter = len(wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "question_link"))))
while True:
    page.send_keys(Keys.END)
    try:
        wait.until(lambda driver: len(driver.find_elements_by_class_name("question_link")) > links_counter)
        links_counter = len(driver.find_elements_by_class_name("question_link"))
    except TimeoutException:
        break


for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
    print(item.text)

driver.quit()

Here we scroll page down and wait up to 10 seconds for more links to be loaded or break the while loop if the number of links remains the same

As for your questions:

  1. wait.until(EC.staleness_of(page)) is not working because when you scroll page down you don't get the new DOM - you just make XHR which adds more links into existed DOM, so the first link (page) will not be stale in this case

  2. (I'm not quite confident about this, but...) I guess you can send keys only to nodes that can be focused (user can set focus manually), e.g. links, input fields, textareas, buttons..., but not content division (div), paragraphs (p), etc

Upvotes: 1

Related Questions