Reputation: 15639
First, I have never used selenium until yesterday. I was able to scrape the target table correctly after many attempts.
I am currently trying to scrape the tables on sequential pages. It works sometimes and other times it fails immediately. I have spent hours surfing Google and Stack Overflow, but I have not solve my problem. I am sure the answer is something simple, but after 8 hours I need to ask a question to the experts in selenium.
My target url is: RedHat Security Advisories
If there is a question on Stack Overflow that answers my problem, please let me know and I will do some my research and testing.
Here are some of the items that I have tried:
Example 1:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[str(page_number))]'))))
browser.find_element_by_xpath('//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[str(page_number)').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
Example 2:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[12]'))))
browser.find_element_by_xpath('//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[12]').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
Upvotes: 0
Views: 768
Reputation: 14145
You can use the below logic.
lastPage = WebDriverWait(driver,120).until(EC.element_to_be_clickable((By.XPATH,"(//ul[starts-with(@class,'pagination hidden-xs ng-scope')]/li[starts-with(@ng-repeat,'pageNumber')])[last()]")))
driver.find_element_by_css_selector("i.web-icon-plus").click()
pages = lastPage.text
pages = '5'
for pNumber in range(1,int(pages)):
currentPage = WebDriverWait(driver,30).until(EC.element_to_be_clickable((By.XPATH,"//ul[starts-with(@class,'pagination hidden-xs ng-scope')]//a[.='" + str(pNumber) + "']")))
print ("===============================================")
print("Current Page : " + currentPage.text)
currentPage.location_once_scrolled_into_view
currentPage.click()
WebDriverWait(driver,120).until_not(EC.element_to_be_clickable((By.CSS_SELECTOR,"#loading")))
# print rows data here
rows = driver.find_elements_by_xpath("//table[starts-with(@class,'cve-table')]/tbody/tr") #<== getting rows here
for row in rows:
print (row.text) <== I am printing all row data, if you want cell data please update the logic accordingly
time.sleep(randint(1, 5)) #<== this step is optional
Upvotes: 1
Reputation: 4595
I believe you can read data directly using url instead of trying for pagination, this will lead to less sync issues because of which script might be failing
Use this xpath to get total no of pages for the security-updates table. //*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[11]
Run loop till page count get from step 1 Inside loop pass page number in below url and send get request https://access.redhat.com/security/security-updates/#/security-advisories?q=&p=page_number&sort=portal_publication_date%20desc&rows=10&portal_advisory_type=Security%20Advisory&documentKind=PortalProduct
wait for page to load
Read data from table populated on page
This process will run till the pagination count
Incase you find specific error that site has blocked the user then you can refresh the page with same page_number.
Upvotes: 0