Reputation: 139
I am trying to navigate a list of links that looks like this:
from a series of webpages like this one and then retrieve the links on each one. This is the HTML:
<ul class="pagination">
<li class="active">
<a href="/dataset groups=heal&_groups_limit=0&page=1">1</a>
</li>
<li>
<a href="/dataset?groups=heal&_groups_limit=0&page=2">2</a>
</li>
<li>
<a href="/dataset?groups=heal&_groups_limit=0&page=3">3</a>
</li>
<li class="disabled">
<a href="#">...</a>
</li>
<li>
<a href="/dataset?groups=heal&_groups_limit=0&page=7">7</a>
</li>
<li>
<a href="/dataset?groups=heal&_groups_limit=0&page=2">»</a>
</li>
</ul>
I managed to find the number of pages, so I am trying to iterate over that. My idea was to select the active element, and the click() on the next one. Being unfamiliar with Xpath I am stumbling on how to do that.
This is the code I am using:
driver.find_element_by_xpath("//li[class='active']/a//following").click()
Any help would be appreciated.
Upvotes: 1
Views: 2612
Reputation: 99
Since the urls are in the format BASE_URL+page=NUM_PAGE, you could simply get the maximum page number (7 in your case).
In that way you can build all the urls with something like:
BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal"
urls = []
for page_num in range(1, MAX_PAGES):
urls.append(f"{BASE_URL}&page={page_num}")
In this way you'll have all the pages without having to click anything, just knowing the maximum number of pages which you can easily find as you already did.
SELENIUM SOLUTION
There are probably one thousand ways and more to do this in a cleaner way, but this one works for me. Simply loop over the list of numbers until finding the "active" as you said and click the following.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal&page=1" #first page
driver = webdriver.Chrome(
executable_path=ChromeDriverManager().install()
)
driver.get(BASE_URL)
url_list_xpath = "/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul" # this is the page bar at the bottom
to_click = False
last_page = driver.find_element_by_xpath("/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul/li[5]/a") \
.get_attribute("href") # find last page
current_page = BASE_URL
# iterate over the urls and click the next url after the active one
while current_page!=last_page:
ul = driver.find_element_by_xpath(url_list_xpath)
for li in ul.find_elements_by_tag_name("li"):
if to_click:
break
if li.get_attribute("class") == 'active':
to_click = True
to_click = False
current_page = li.find_elements_by_tag_name("a")[0].get_attribute("href")
driver.get(current_page)
Upvotes: 1
Reputation: 105
EDITED ( I adapted the code to your link)
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://dati.comune.milano.it/callgroup/e6528afc-bd2c-417b-99a8-d7704f942a42')
hrefs = [] #will contain the final list of links
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #scrolls to the bottom of the page
# get tags
hrefs_in_view = driver.find_elements_by_tag_name('a') #grabs all elements tagged 'a' (that contain links)
# finding relevant hrefs
for elem in hrefs_in_view:
if isinstance(elem,str) or elem is None: #remove some irrelevants elements
continue
if 'dataset' in elem.get_attribute('href'): #all links should contain the word 'dataset'. You can change this and adapt it to your needs #if the link corresponds to your needs
hrefs.append(elem.get_attribute('href')) #add it to the list
The list hrefs will contain all the links you need.
Upvotes: 2