Warren Crasta
Warren Crasta

Reputation: 488

How to scrape page with BeautifulSoup? Page Source not matching Inspect Element

I'm trying to scrape a few things from this fantasy basketball page. I'm using BeautifulSoup in Python 3.5+ to do this.

source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')

To begin with, I'd like to scrape the titles for the 9 categories into a Python list. So my list should look like categories = [FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS].

What I hoped to do is something like the following:

tableSubHead = soup.find_all('tr', class_='Table2__header-row')
tableSubHead = tableSubHead[0]
listCats = tableSubHead.find_all('th')
categories = []
for cat in listCats:
  if 'title' in cat.attrs:
  categories.append(cat.string)

However, the soup.find_all('tr', class_='Table2__header-row') returns an empty list instead of the table row element I want. I suspect this is because when I view the page source, it's completely different from Inspect Element in Chrome Dev Tools. I understand this is because Javascript changes the elements on the page dynamically, but I'm not sure what the solution would be.

Upvotes: 4

Views: 3239

Answers (2)

Rocky Li
Rocky Li

Reputation: 5958

The problem you're facing is because this website is a web-app, which means javascript will have to run to generate what you're seeing, you can't run javascript with request, here's what I did to get the result with selenium which opens a headless browser and enable javascript to run first by waiting for a period of time:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time

# source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')

options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
time.sleep(15)

plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'lxml')

soup.select('.Table2__header-row') # Returns full results.

len(soup.select('.Table2__header-row')) # 8

This approach will allow you to run website that are designed as a webapp, and greatly expand your functionality. - you can even add commands to execute like scrolling or clicking to load more sources on the flight.

Use pip install selenium to install selenium. Also allows you to use Firefox if you prefer that browser.

Upvotes: 4

katzrkool
katzrkool

Reputation: 123

This may not be exactly what you are looking for, but since the page source has nothing on it, it's not really that usable. But, apparently, when loading the scoreboard, the site makes a couple API calls that most likely have all the data you are looking for.

There's one API call here that appears to have all the information you are looking for.

import requests
payload = {"view":["mMatchupScore","mScoreboard","mSettings","mTeam","modular","mNav"]}
r = requests.get("http://fantasy.espn.com/apis/v3/games/fba/seasons/2019/segments/0/leagues/633975", params=payload).json()

# r is a json object with all the data in it

Upvotes: 2

Related Questions