Maciej K
Maciej K

Reputation: 21

Selenium-chrome driver.get() in a loop breaks after a couple of repetitions

I want to scrape data from web-page that is constantly changing (new posts every couple of seconds). I'm calling driver.get() in a while loop but after a couple of repetitions I'm not getting new results. It constantly returns the same post over and over. I'm sure that the page is changing (checked in the browser)

I tried to use time.wait() and driver.refresh() but the problem persists

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=self.cp.getSeleniumDriverPath())

    while True:
        driver.get(url)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        posts = soup.find_all(some class)

        (...)
        some logic with the result
        (...)

        driver.refresh() #tried interchangably with driver.get() from the beginning of loop

As far as I know, driver.get() should wait for a page to load before executing next line of code. Maybe I did something wrong language-wise (I'm pretty new to python). Should I reset some attributes of driver every loop run? I've seen solutions that are using driver.get() in a loop like that, but it is not working in my case. How do I force the driver to fully refresh the page before scraping it?

Upvotes: 2

Views: 3460

Answers (2)

Frederik Bode
Frederik Bode

Reputation: 2744

I'm guessing your Chrome webdriver is caching. Try adding this: driver.manage().deleteAllCookies() before getting the page.

Upvotes: 0

Reedinationer
Reedinationer

Reputation: 5774

selenium will have errors if the page is in the process of loading when you try to send commands to the window. You should implement a time.sleep() or some selenium specific wait method to make sure that the page is ready to be processed. Something like

import time

    while True:
        driver.get(url)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        posts = soup.find_all(some class)

        (...)
        some logic with the result
        (...)

        driver.refresh()
        time.sleep(5) # probably too long, but I usually try to stay on the safe side

The best option would probably be to use something like

element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )

from the link above I posted, this will make sure the element is there while not forcing a wait of 5 seconds. If the element you want is there in .0001 seconds your script will continue after that long. This lets you make the timeout arbitrarily large (say, 120 seconds) without impacting your execution speed.

Upvotes: 1

Related Questions