Xin
Xin

Reputation: 674

How to select tab and scrape results for all pages with Selenium?

I have created the function below to scrape results from the website, I am wondering how to:

  1. First click on the "Tables (8,899)" tab and only scrape results from there.
  2. Right now it only scrapes the first page, how would I go about scraping all the pages and appending them into one dataframe without having to specify the number of pages?

Function:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium import webdriver
from functools import reduce

def stats_canada():
     driver = webdriver.Chrome('/Users/wwds/Desktop/chromedriver')
     driver.get('https://www150.statcan.gc.ca/n1/en/type/data?count=100&p=-All%2C5-data/tables#all')
     elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#all a[target='_self']")))
     linkTitles = pd.DataFrame([title.text for title in elements]).rename(columns = {0 : 'Name'})
     links = pd.DataFrame([link.get_attribute("href") for link in elements]).rename(columns = {0 : 'Link'})
     elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#all span[class='ndm-result-date']")))
     release_date = pd.DataFrame([date.text for date in elements]).rename(columns = {'0' : 'Release Date'}).rename(columns = {0 : 'Release Date'})
     elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#all div[class='ndm-result-productid']")))
     table_id = pd.DataFrame([table.text for table in elements]).rename(columns = {0 : 'Table ID'})
     table_id['Table ID'] = table_id['Table ID'].str.replace("Table: ", "")
     data = reduce(lambda x,y: pd.merge(x, y, left_index = True, right_index = True), [linkTitles, links, release_date, table_id])
     return data


stats_canada()

Thanks in advance!

Upvotes: 1

Views: 335

Answers (1)

Tanmoy Datta
Tanmoy Datta

Reputation: 1644

Firstly you have the id for "Tables (8,899)" tab and you have to click on it. For this you can use the fowling-

elem = driver.find_element_by_id('tables-lnk')
elem.click()
time.sleep(10) #this delay is for loading the page

Now you have to scrape every entry from this page using selenium or beautiful soup whatever you are familiar with and add them to your dataframe.

Then you have to click the next button below the page. you can find the button id and click the button on the above way.

Upvotes: 2

Related Questions