Navigate pagination with Selenium Webdriver

I am trying to navigate a list of links that looks like this: this list of links

from a series of webpages like this one and then retrieve the links on each one. This is the HTML:

<ul class="pagination">
  <li class="active">
    <a href="/dataset groups=heal&amp;_groups_limit=0&amp;page=1">1</a>
  </li>
  <li>
    <a href="/dataset?groups=heal&amp;_groups_limit=0&amp;page=2">2</a>
  </li>
  <li>
    <a href="/dataset?groups=heal&amp;_groups_limit=0&amp;page=3">3</a>
  </li>
  <li class="disabled">
    <a href="#">...</a>
  </li>
  <li>
    <a href="/dataset?groups=heal&amp;_groups_limit=0&amp;page=7">7</a>
  </li>
  <li>
    <a href="/dataset?groups=heal&amp;_groups_limit=0&amp;page=2">»</a>
  </li>
</ul>

I managed to find the number of pages, so I am trying to iterate over that. My idea was to select the active element, and the click() on the next one. Being unfamiliar with Xpath I am stumbling on how to do that.

This is the code I am using:

driver.find_element_by_xpath("//li[class='active']/a//following").click()

Any help would be appreciated.

Upvotes: 1

Views: 2612

Answers (2)

Matteo Moreschini
Matteo Moreschini

Reputation: 99

Since the urls are in the format BASE_URL+page=NUM_PAGE, you could simply get the maximum page number (7 in your case).

In that way you can build all the urls with something like:

BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal"
urls = []
for page_num in range(1, MAX_PAGES):
    urls.append(f"{BASE_URL}&page={page_num}") 

In this way you'll have all the pages without having to click anything, just knowing the maximum number of pages which you can easily find as you already did.

SELENIUM SOLUTION

There are probably one thousand ways and more to do this in a cleaner way, but this one works for me. Simply loop over the list of numbers until finding the "active" as you said and click the following.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal&page=1" #first page

driver = webdriver.Chrome(
        executable_path=ChromeDriverManager().install()
    )
driver.get(BASE_URL)

url_list_xpath = "/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul" # this is the page bar at the bottom

to_click = False
last_page = driver.find_element_by_xpath("/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul/li[5]/a") \
    .get_attribute("href") # find last page
current_page = BASE_URL

# iterate over the urls and click the next url after the active one
while current_page!=last_page:
    ul = driver.find_element_by_xpath(url_list_xpath)
    for li in ul.find_elements_by_tag_name("li"):
        if to_click:
            break
        if li.get_attribute("class") == 'active':
            to_click = True
    to_click = False  
    current_page = li.find_elements_by_tag_name("a")[0].get_attribute("href")
    driver.get(current_page)

Upvotes: 1

Ali Doggaz
Ali Doggaz

Reputation: 105

EDITED ( I adapted the code to your link)

from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://dati.comune.milano.it/callgroup/e6528afc-bd2c-417b-99a8-d7704f942a42')
hrefs = [] #will contain the final list of links
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #scrolls to the bottom of the page
# get tags
hrefs_in_view = driver.find_elements_by_tag_name('a') #grabs all elements tagged 'a' (that contain links)
# finding relevant hrefs
for elem in hrefs_in_view:
    if isinstance(elem,str) or elem is None:     #remove some irrelevants elements
        continue
    if 'dataset' in elem.get_attribute('href'): #all links should contain the word 'dataset'. You can change this and adapt it to your needs #if the link corresponds to your needs
        hrefs.append(elem.get_attribute('href')) #add it to the list

The list hrefs will contain all the links you need.

Upvotes: 2

Related Questions