Shiva Govindaswamy
Shiva Govindaswamy

Reputation: 719

Pagination for BeautifulSoup with Selenium in python webscraping

I'm new to Webscraping and hence to bs4 and selenium too. I was able to retrieve data from the table in Olympics medalist's page. But i have no idea how to obtain the data form the rest of the pages, since the doesn't update it's url according to the page (based from initial tutorials I went thru).

I would like to know how I'd could cycle through the pages in this case.

Edit: Thanks for all the answers. Every answer added a concept for me to get data from a website in different ways. Thanks.

Upvotes: 1

Views: 621

Answers (3)

Ram
Ram

Reputation: 4779

That page fetches the medals data from a JSON file. You can make a request to that JSON file and fetch the complete medals data.

Here is the URL of that JSON file.

https://olympics.com/tokyo-2020/olympic-games/en/results/all-sports/zzjm094b.json

Here is the code that prints sample medals data like Player name, Country, Medal etc. You can analyse the JSON file and extract whatever data you need.

import requests

url = 'https://olympics.com/tokyo-2020/olympic-games/en/results/all-sports/zzjm094b.json'
r = requests.get(url)
j = r.json()

for i in j['medallistsJSON'][:10]:
    print(f"{i['a_name']:25} {i['c_code']:10} {i['m_link']}")

Sample Medals Data

KIM Je Deok               KOR        Gold Medal
AN San                    KOR        Gold Medal
SCHLOESSER Gabriela       NED        Silver Medal
WIJLER Steve              NED        Silver Medal
ALVAREZ Luis              MEX        Bronze Medal
VALENCIA Alejandra        MEX        Bronze Medal
CARAPAZ Richard           ECU        Gold Medal
van AERT Wout             BEL        Silver Medal
POGACAR Tadej             SLO        Bronze Medal
SZILAGYI Aron             HUN        Gold Medal

Upvotes: 1

cruisepandey
cruisepandey

Reputation: 29362

Using Selenium,

  1. You have to accept the cookies button.
  2. You have to scroll to each element to get the data.
  3. You have to click on next page to see the new content.
  4. Below code is to just extract data from 5 starting page, also there is a variable in code number_of_pages which is set to 5, in case you want to grab data for nth page, you will have to change this value.

Code :

driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://olympics.com/tokyo-2020/olympic-games/en/results/all-sports/medalists.htm")
wait = WebDriverWait(driver, 20)

wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='onetrust-accept-btn-handler']"))).click()

number_of_pages = 5
page_to_start_clicking = 2
for i in range(1, 5):
    time.sleep(2)
    #lnght_of_table = len(driver.find_elements(By.CSS_SELECTOR, "div.playerTag span:nth-of-type(2)"))
    for ele in driver.find_elements(By.CSS_SELECTOR, "div.playerTag span:nth-of-type(2)"):
        driver.execute_script("arguments[0].scrollIntoView(true);", ele)
        print(ele.text)
    wait.until(EC.element_to_be_clickable((By.LINK_TEXT, f"{page_to_start_clicking}"))).click()
    page_to_start_clicking = page_to_start_clicking + 1

Imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Output :-

CAVARS Agnis
KRUMINS Edgars
LASMANIS Karlis
MIEZIS Nauris
DOLSON Stefanie
GRAY Allisha
PLUM Kelsey
YOUNG Jacquelyn
GAZOZ Mete
KIM Je Deok
KIM Woojin
OH Jinhyek
AN San
AN San
JANG Minhee
KANG Chaeyoung
AN San
KIM Je Deok
HASHIMOTO Daiki
DOLGOPYAT Artem
HASHIMOTO Daiki
ZOU Jingyuan
WHITLOCK Max
LIU Yang
SHIN Jeahwan
ABLIAZIN Denis
BELYAVSKIY David
DALALOYAN Artur
NAGORNYY Nikita
LEE Sunisa
GUAN Chenchen
CAREY Jade
DERWAEL Nina
ANDRADE Rebeca
AKHAIMOVA Liliia
LISTUNOVA Viktoriia
MELNIKOVA Angelina
URAZOVA Vladislava
KOLESNICHENKO Svetlana
ROMASHINA Svetlana
CHIGIREVA Vlada
GOLIADKINA Marina
KOLESNICHENKO Svetlana
KOMAR Polina
PATSKEVICH Aleksandra
ROMASHINA Svetlana
SHISHKINA Alla
SHUROCHKINA Maria
BAREGA Selemon
JACOBS Lamont Marcell
PARCHMENT Hansle
INGEBRIGTSEN Jakob
de GRASSE Andre
STANO Massimo
EL BAKKALI Soufiane
GARDINER Steven
WARHOLM Karsten
DESALU Eseosa Fostine
JACOBS Lamont Marcell
PATTA Lorenzo
TORTU Filippo
BENJAMIN Rai
CHERRY Michael
DEADMON Bryce
NORMAN Michael
NORWOOD Vernon
ROSS Randolph
STEWART Trevor
CHEPTEGEI Joshua
TOMALA Dawid
KORIR Emmanuel Kipkurui
WARNER Damian
STAHL Daniel
NOWICKI Wojciech
TAMBERI Gianmarco
BARSHIM Mutaz Essa
CHOPRA Neeraj
TENTOGLOU Miltiadis
KIPCHOGE Eliud
DUPLANTIS Armand

Process finished with exit code 0

Upvotes: 2

Prophet
Prophet

Reputation: 33351

In order to go to the next page you have to click the next pagination button on the bottom right part of the page.
To do so you will have to scroll the page down and click that page.
For scrolling you will use action_chains class.
So per each page you are going to gather your data and then do this:

next_page_btn = driver.find_element_by_xpath('//li[@class="paginate_button page-item next"]//a')
actions.move_to_element(next_page_btn).perform()
time.sleep(0.5)
next_page_btn.click()

Before that you will have to import

from selenium.webdriver.common.action_chains import ActionChains

and initialize the actions object with

actions = ActionChains(driver)

Upvotes: 1

Related Questions