Reputation: 747
I would like to scrape all the statistics in the page https://fantasy.premierleague.com/statistics
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import numpy as np
options = Options()
options.add_argument("--start-maximized")
import time
start = time.process_time()
time.sleep(3)
s = Service(path)
driver = webdriver.Chrome(options=options, service=s)
#go to page
driver.get('https://fantasy.premierleague.com/statistics')
wait = WebDriverWait(driver, 2)
#accept cookies
try:
wait.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[2]/div/div/div[1]/div[5]/button[1]'))).click()
except:
pass
list= np.arange(1, 22, 1).tolist()
for i in list:
#extract table on the first page
content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")
table = soup.find_all('table',attrs={'class':'Table-ziussd-1 ElementTable-sc-1v08od9-0 dUELIG OZmJL'})
df = pd.read_html(str(table))[0]
df.drop(columns=df.columns[0], axis=1, inplace=True)
df.to_parquet(f'table_pg_{i}_{date}.parquet.gzip')
pd.read_parquet(f'table_pg_{i}_{date}.parquet.gzip')
#scroll down to get the page data below the first scroll
scrollDown = "window.scrollBy(0,2000);"
driver.execute_script(scrollDown)
#driver.execute_script("window.scrollTo(50, document.body.scrollHeight);")
try:
#click on the next button
wait = WebDriverWait(driver, 2)
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div[2]/div/div[1]/div[3]/button[3]/svg'))).click()
except:
pass
print('Execution Time: ', time.process_time() - start)
#check the last table
pd.read_parquet(f'table_pg_{i}_{date}.parquet.gzip')
The code does not return any error message, but the last table scraped should be on page 21,
as in the screenshot below
whereas my parquet file returns the results on the first page.
Both the previous and next buttons have the same CSS selectors.
Upvotes: 2
Views: 68
Reputation: 10460
You surely must have a reason to use selenium.. but just in case, here is a less overhead..ish solution, avoiding Selenium:
import requests
import pandas as pd
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'
r = requests.get(url)
df = pd.DataFrame(r.json()['elements'])
df.sort_values(by=['total_points'], inplace=True, ascending=False)
print(df[['web_name', 'now_cost', 'form', 'total_points']])
Result:
web_name now_cost form total_points
393 Haaland 119 11.2 67
91 Toney 71 7.5 45
538 Kane 114 6.7 40
259 Mitrović 68 6.5 39
314 Rodrigo 64 6.3 38
... ... ... ... ...
243 Garner 45 0.0 0
0 Cédric 42 0.0 0
86 Senesi 45 -0.2 -1
412 Shaw 47 -0.2 -1
410 Maguire 47 -0.2 -1
624 rows × 4 columns
Data in that webpage is being pulled dynamically from an API endpoint. This is visible in Dev tools - Network tab. By scraping that endpoint, you get a fairly large JSON object, which you can dissect and extract the visible table in page, and also other stuffs, if you are so inclined (just inspect it).
This is python Requests documentation: https://requests.readthedocs.io/en/latest/
Upvotes: 1
Reputation: 33361
In this command try changing from wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div[2]/div/div[1]/div[3]/button[3]/svg'))).click()
to
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button svg[class*='ChevronRight']"))).click()
Also make sure you are scrolling to the bottom so this element becomes clickable
Upvotes: 0