Reputation: 21
I want to scrape data from web-page that is constantly changing (new posts every couple of seconds). I'm calling driver.get() in a while loop but after a couple of repetitions I'm not getting new results. It constantly returns the same post over and over. I'm sure that the page is changing (checked in the browser)
I tried to use time.wait() and driver.refresh() but the problem persists
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=self.cp.getSeleniumDriverPath())
while True:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
posts = soup.find_all(some class)
(...)
some logic with the result
(...)
driver.refresh() #tried interchangably with driver.get() from the beginning of loop
As far as I know, driver.get() should wait for a page to load before executing next line of code. Maybe I did something wrong language-wise (I'm pretty new to python). Should I reset some attributes of driver every loop run? I've seen solutions that are using driver.get() in a loop like that, but it is not working in my case. How do I force the driver to fully refresh the page before scraping it?
Upvotes: 2
Views: 3460
Reputation: 2744
I'm guessing your Chrome webdriver is caching. Try adding this:
driver.manage().deleteAllCookies()
before getting the page.
Upvotes: 0
Reputation: 5774
selenium
will have errors if the page is in the process of loading when you try to send commands to the window. You should implement a time.sleep()
or some selenium specific wait method to make sure that the page is ready to be processed. Something like
import time
while True:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
posts = soup.find_all(some class)
(...)
some logic with the result
(...)
driver.refresh()
time.sleep(5) # probably too long, but I usually try to stay on the safe side
The best option would probably be to use something like
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
from the link above I posted, this will make sure the element is there while not forcing a wait of 5 seconds. If the element you want is there in .0001 seconds your script will continue after that long. This lets you make the timeout arbitrarily large (say, 120 seconds) without impacting your execution speed.
Upvotes: 1