Reputation: 111
I am using python to scrape some data from a website in combination with selenium and Beautiful Soup. This page has buttons you can click which change the data displayed in the tables, but this is all handled by the javascript in the page. The page url does not change. Selenium successfully renders the javascript on the page on load, but it continues using the previous state (before the clicks) therefore, scraping the same data instead of the new data.
I tried following the solutions given on Obey The Testing Goat but it always seemed to timeout and not turn the state stale. I've tried waiting for 10 seconds manually by using a time.sleep for it to wait for the state to possibly refresh in a while. I've tried using WebDriverWait
to wait until the old page turned stale. I've tried looking through the selenium documentation for possible solutions. The code presented below attempts to use the solution presented in the website, but it simply times out no matter the timeout rate.
from selenium.webdriver.support.wait import WebDriverWait
from contextlib import contextmanager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of
class MySeleniumTest():
# assumes self.browser is a selenium webdriver
def __init__(self, browser, soup):
self.browser = browser
self.soup = soup
@contextmanager
def wait_for_page_load(self, timeout=30):
old_page = self.browser.find_element_by_tag_name('html')
yield
WebDriverWait(self.browser, timeout).until(staleness_of(old_page))
def tryChangingState(self):
with self.wait_for_page_load(timeout=20):
og_state = self.soup
tab = self.browser.find_element_by_link_text('Breakfast')
tab.click()
tab = self.browser.find_element_by_link_text('Lunch')
tab.click()
new_state = self.soup
# check if the HTML code has changed
print(og_state != new_state)
# create tester object
tester = MySeleniumTest(browser, soup)
# try changing state by after clicking on button
tester.tryChangingState()
I'm not sure if I'm using it in the correct way or not. I also tried creating a new with self.wait_for_page_load(timeout=20):
after the first click and put the rest of the code within that, but this also did not work. I would expect og_state != new_state
to result in true
implying the HTML changed, but the actual result is false
.
Upvotes: 1
Views: 2051
Reputation: 111
Original poster here. I found the reason for the issue. The state was being updated in selenium but since I was using Beautiful Soup for parsing, the Beautiful Soup object was using the source code from the previous selenium web driver object. But updating the soup object each time the page was clicked, the scraper was able to successfully gather the new data.
I updated the soup object by simply calling soup = BeautifulSoup(browser.page_source, 'lxml')
In other words, I didn't need to worry about the state of the selenium web driver, it was simply an issue of updating the source code the parser was reading.
Upvotes: 1