Life is complex
Life is complex

Reputation: 15639

Accessing the next page using selenium

First, I have never used selenium until yesterday. I was able to scrape the target table correctly after many attempts.

I am currently trying to scrape the tables on sequential pages. It works sometimes and other times it fails immediately. I have spent hours surfing Google and Stack Overflow, but I have not solve my problem. I am sure the answer is something simple, but after 8 hours I need to ask a question to the experts in selenium.

My target url is: RedHat Security Advisories

If there is a question on Stack Overflow that answers my problem, please let me know and I will do some my research and testing.

Here are some of the items that I have tried:

Example 1:

page_number = 0
while True:
  try:
    page_number += 1

    browser.execute_script("return arguments[0].scrollIntoView(true);",
                           WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
                                                                                                  '2]/dir-pagination-controls/ul/li[str(page_number))]'))))

    browser.find_element_by_xpath('//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[str(page_number)').click()

    print(f"Navigating to page {page_number}")

    # I added this because my connection was 
    # being terminated by RedHat
    time.sleep(20)

except (TimeoutException, WebDriverException) as e:
    print("Last page reached")
    break

except Exception as e:
    print (e)
    break

Example 2:

page_number = 0
  while True:
   try:
     page_number += 1

     browser.execute_script("return arguments[0].scrollIntoView(true);",
                           WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
                                                                                                  '2]/dir-pagination-controls/ul/li[12]'))))

     browser.find_element_by_xpath('//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[12]').click()

     print(f"Navigating to page {page_number}")

     # I added this because my connection was 
     # being terminated by RedHat
     time.sleep(20)

 except (TimeoutException, WebDriverException) as e:
     print("Last page reached")
     break

 except Exception as e:
    print (e)
    break

Upvotes: 0

Views: 768

Answers (2)

supputuri
supputuri

Reputation: 14145

You can use the below logic.

lastPage = WebDriverWait(driver,120).until(EC.element_to_be_clickable((By.XPATH,"(//ul[starts-with(@class,'pagination hidden-xs ng-scope')]/li[starts-with(@ng-repeat,'pageNumber')])[last()]")))
driver.find_element_by_css_selector("i.web-icon-plus").click()
pages = lastPage.text
pages = '5'
for pNumber in range(1,int(pages)):
    currentPage = WebDriverWait(driver,30).until(EC.element_to_be_clickable((By.XPATH,"//ul[starts-with(@class,'pagination hidden-xs ng-scope')]//a[.='" + str(pNumber) + "']")))
    print ("===============================================")
    print("Current Page : " + currentPage.text)
    currentPage.location_once_scrolled_into_view
    currentPage.click()
    WebDriverWait(driver,120).until_not(EC.element_to_be_clickable((By.CSS_SELECTOR,"#loading")))
    # print rows data here
    rows = driver.find_elements_by_xpath("//table[starts-with(@class,'cve-table')]/tbody/tr") #<== getting rows here
    for row in rows:
        print (row.text) <== I am printing all row data, if you want cell data please update the logic accordingly
    time.sleep(randint(1, 5)) #<== this step is optional

Upvotes: 1

Amit Jain
Amit Jain

Reputation: 4595

I believe you can read data directly using url instead of trying for pagination, this will lead to less sync issues because of which script might be failing

  1. Use this xpath to get total no of pages for the security-updates table. //*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[11]

  2. Run loop till page count get from step 1 Inside loop pass page number in below url and send get request https://access.redhat.com/security/security-updates/#/security-advisories?q=&p=page_number&sort=portal_publication_date%20desc&rows=10&portal_advisory_type=Security%20Advisory&documentKind=PortalProduct

  3. wait for page to load

  4. Read data from table populated on page

  5. This process will run till the pagination count

  6. Incase you find specific error that site has blocked the user then you can refresh the page with same page_number.

Upvotes: 0

Related Questions