Reputation: 488
I'm trying to scrape a few things from this fantasy basketball page. I'm using BeautifulSoup in Python 3.5+ to do this.
source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
To begin with, I'd like to scrape the titles for the 9 categories into a Python list. So my list should look like categories = [FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS]
.
What I hoped to do is something like the following:
tableSubHead = soup.find_all('tr', class_='Table2__header-row')
tableSubHead = tableSubHead[0]
listCats = tableSubHead.find_all('th')
categories = []
for cat in listCats:
if 'title' in cat.attrs:
categories.append(cat.string)
However, the soup.find_all('tr', class_='Table2__header-row')
returns an empty list instead of the table row element I want. I suspect this is because when I view the page source, it's completely different from Inspect Element in Chrome Dev Tools. I understand this is because Javascript changes the elements on the page dynamically, but I'm not sure what the solution would be.
Upvotes: 4
Views: 3239
Reputation: 5958
The problem you're facing is because this website is a web-app, which means javascript will have to run to generate what you're seeing, you can't run javascript with request
, here's what I did to get the result with selenium
which opens a headless browser and enable javascript to run first by waiting for a period of time:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
# source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
time.sleep(15)
plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'lxml')
soup.select('.Table2__header-row') # Returns full results.
len(soup.select('.Table2__header-row')) # 8
This approach will allow you to run website that are designed as a webapp, and greatly expand your functionality. - you can even add commands to execute like scrolling or clicking to load more sources on the flight.
Use pip install selenium
to install selenium. Also allows you to use Firefox if you prefer that browser.
Upvotes: 4
Reputation: 123
This may not be exactly what you are looking for, but since the page source has nothing on it, it's not really that usable. But, apparently, when loading the scoreboard, the site makes a couple API calls that most likely have all the data you are looking for.
There's one API call here that appears to have all the information you are looking for.
import requests
payload = {"view":["mMatchupScore","mScoreboard","mSettings","mTeam","modular","mNav"]}
r = requests.get("http://fantasy.espn.com/apis/v3/games/fba/seasons/2019/segments/0/leagues/633975", params=payload).json()
# r is a json object with all the data in it
Upvotes: 2