Reputation: 719
I'm new to Webscraping and hence to bs4 and selenium too. I was able to retrieve data from the table in Olympics medalist's page. But i have no idea how to obtain the data form the rest of the pages, since the doesn't update it's url according to the page (based from initial tutorials I went thru).
I would like to know how I'd could cycle through the pages in this case.
Edit: Thanks for all the answers. Every answer added a concept for me to get data from a website in different ways. Thanks.
Upvotes: 1
Views: 621
Reputation: 4779
That page fetches the medals data from a JSON file. You can make a request to that JSON file and fetch the complete medals data.
Here is the URL of that JSON file.
https://olympics.com/tokyo-2020/olympic-games/en/results/all-sports/zzjm094b.json
Here is the code that prints sample medals data like Player name, Country, Medal etc. You can analyse the JSON file and extract whatever data you need.
import requests
url = 'https://olympics.com/tokyo-2020/olympic-games/en/results/all-sports/zzjm094b.json'
r = requests.get(url)
j = r.json()
for i in j['medallistsJSON'][:10]:
print(f"{i['a_name']:25} {i['c_code']:10} {i['m_link']}")
Sample Medals Data
KIM Je Deok KOR Gold Medal
AN San KOR Gold Medal
SCHLOESSER Gabriela NED Silver Medal
WIJLER Steve NED Silver Medal
ALVAREZ Luis MEX Bronze Medal
VALENCIA Alejandra MEX Bronze Medal
CARAPAZ Richard ECU Gold Medal
van AERT Wout BEL Silver Medal
POGACAR Tadej SLO Bronze Medal
SZILAGYI Aron HUN Gold Medal
Upvotes: 1
Reputation: 29362
Using Selenium,
number_of_pages
which is set to 5, in case you want to grab data for nth
page, you will have to change this value.Code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://olympics.com/tokyo-2020/olympic-games/en/results/all-sports/medalists.htm")
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='onetrust-accept-btn-handler']"))).click()
number_of_pages = 5
page_to_start_clicking = 2
for i in range(1, 5):
time.sleep(2)
#lnght_of_table = len(driver.find_elements(By.CSS_SELECTOR, "div.playerTag span:nth-of-type(2)"))
for ele in driver.find_elements(By.CSS_SELECTOR, "div.playerTag span:nth-of-type(2)"):
driver.execute_script("arguments[0].scrollIntoView(true);", ele)
print(ele.text)
wait.until(EC.element_to_be_clickable((By.LINK_TEXT, f"{page_to_start_clicking}"))).click()
page_to_start_clicking = page_to_start_clicking + 1
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :-
CAVARS Agnis
KRUMINS Edgars
LASMANIS Karlis
MIEZIS Nauris
DOLSON Stefanie
GRAY Allisha
PLUM Kelsey
YOUNG Jacquelyn
GAZOZ Mete
KIM Je Deok
KIM Woojin
OH Jinhyek
AN San
AN San
JANG Minhee
KANG Chaeyoung
AN San
KIM Je Deok
HASHIMOTO Daiki
DOLGOPYAT Artem
HASHIMOTO Daiki
ZOU Jingyuan
WHITLOCK Max
LIU Yang
SHIN Jeahwan
ABLIAZIN Denis
BELYAVSKIY David
DALALOYAN Artur
NAGORNYY Nikita
LEE Sunisa
GUAN Chenchen
CAREY Jade
DERWAEL Nina
ANDRADE Rebeca
AKHAIMOVA Liliia
LISTUNOVA Viktoriia
MELNIKOVA Angelina
URAZOVA Vladislava
KOLESNICHENKO Svetlana
ROMASHINA Svetlana
CHIGIREVA Vlada
GOLIADKINA Marina
KOLESNICHENKO Svetlana
KOMAR Polina
PATSKEVICH Aleksandra
ROMASHINA Svetlana
SHISHKINA Alla
SHUROCHKINA Maria
BAREGA Selemon
JACOBS Lamont Marcell
PARCHMENT Hansle
INGEBRIGTSEN Jakob
de GRASSE Andre
STANO Massimo
EL BAKKALI Soufiane
GARDINER Steven
WARHOLM Karsten
DESALU Eseosa Fostine
JACOBS Lamont Marcell
PATTA Lorenzo
TORTU Filippo
BENJAMIN Rai
CHERRY Michael
DEADMON Bryce
NORMAN Michael
NORWOOD Vernon
ROSS Randolph
STEWART Trevor
CHEPTEGEI Joshua
TOMALA Dawid
KORIR Emmanuel Kipkurui
WARNER Damian
STAHL Daniel
NOWICKI Wojciech
TAMBERI Gianmarco
BARSHIM Mutaz Essa
CHOPRA Neeraj
TENTOGLOU Miltiadis
KIPCHOGE Eliud
DUPLANTIS Armand
Process finished with exit code 0
Upvotes: 2
Reputation: 33351
In order to go to the next page you have to click the next
pagination button on the bottom right part of the page.
To do so you will have to scroll the page down and click that page.
For scrolling you will use action_chains
class.
So per each page you are going to gather your data and then do this:
next_page_btn = driver.find_element_by_xpath('//li[@class="paginate_button page-item next"]//a')
actions.move_to_element(next_page_btn).perform()
time.sleep(0.5)
next_page_btn.click()
Before that you will have to import
from selenium.webdriver.common.action_chains import ActionChains
and initialize the actions
object with
actions = ActionChains(driver)
Upvotes: 1