Reputation: 117
I wrote a code in selenium to extract number of Rounds in a soccer league, all elements are the same for all pages from what I can see but for some reason, the code works for some links and does not work for others.
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from time import sleep
def pack_links(l):
options = Options()
options.headless = True
driver = webdriver.Chrome()
driver.get(l)
rnds = driver.find_element_by_id('showRound')
a_ = rnds.find_elements_by_xpath(".//td[@class='lsm2']")
#a_ = driver.find_elements_by_class_name('lsm2')
knt = 0
for _ in a_:
knt = knt+1
print(knt)
sleep(2)
driver.close()
return None
link = 'http://info.nowgoal.com/en/League/34.html'
pack_links(link)
Here is a link that works Nowgoal Serie B, it returns the number of td
tags with class lsm2
and a picture of what the source page looks like
And this one return's 0,for some reason it does not find the tags with class lsm2
Nowgoal Serie A, and also a picture of the segment of interest
Even when I trying to find it directly with this commented line
a_ = driver.find_elements_by_class_name('lsm2')
it still returns 0. I will appreciate any help with this.
Upvotes: 1
Views: 942
Reputation: 115
As far as I understand, the inner HTML of td with "showRound" id is dynamic and loaded by showRound() JS function, which in its turn is invoked by script within the page's head tag on page load. Consequently, in your case it just seems not to get enough time to load. I've tried to solve this issue in two ways:
A kludge one: use driver.implicitly_wait(number_of_seconds_to_wait). I would also recommend to use it instead of sleep() in the future. However, this solution is quite clumsy and kind of asynchronous; in other words, it waits primarily for seconds countdown not for result.
We may wait for the first element with "lsm2" class to load; if it fails to do so after some reasonable timeout we may stop waiting and raise en exception (thanks to Zeinab Abbasimazar for the answer here). This may be achieved through expected_conditions and WebDriverWait:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
def pack_links(l):
options = webdriver.ChromeOptions() # I would also suggest to use this instead of Options()
options.add_argument("--headless")
options.add_argument("--enable-javascript") # To be on the safe side, although it seems to be enabled by default
driver = webdriver.Chrome("path_to_chromedriver_binary", options=options)
driver.get(l)
rnds = driver.find_element_by_id('showRound')
"""Until now, your code has gone almost unchanged. Now let's wait for the first td element with lsm2 class to load, with setting maximum timeout of 5 seconds:"""
try:
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, "lsm2")))
print("All necessary tables have been loaded successfully")
except TimeoutException:
raise("Timeout error")
"""Then we proceed in case of success:"""
a_ = rnds.find_elements_by_xpath(".//td[@class='lsm2']")
knt = 0
for _ in a_:
knt = knt+1
print(knt)
driver.implicitly_wait(2) # Not sure if it is needed here anymore
driver.close()
driver.quit() # I would also recommend to make sure you quit the driver not only close it if you don't want to kill numerous RAM-greedy Chrome processes by hand
return None
You can make some experiments and tweak timeout length you need to achieve the necessary result. I would also suggest to use len(a_) instead of iterating with for loop, but it's up to you.
Upvotes: 1