JBL
JBL

Reputation: 11

Beautiful Soup Not Getting NBA.com Data

I want to extract the data from the table on this webpage: http://stats.nba.com/league/team/#!/advanced/ . Unfortunately, the following code does not give me anything because the soup (see below) contains no "td"s, even though there are many "td"s to be found when inspecting the webpage.

On the other hand, running the same code for the website "http://espn.go.com/nba/statistics/team/_/stat/offense-per-game" does give me what I want.

Why does the code work for one site and not the other, and is there anything I can do to get the data I want from the first site?


import requests
from bs4 import BeautifulSoup
url="http://stats.nba.com/league/team/#!/advanced/"
r=requests.get(url)
soupNBAadv=BeautifulSoup(r.content)

tds=soupNBAadv.find_all("td")
for i in tds:
    print i.text

Upvotes: 1

Views: 1315

Answers (2)

alecxe
alecxe

Reputation: 474001

You don't need BeautifulSoup here at all. The table you see in the browser is formed with the help of an additional get request to an endpoint which returns a JSON response, simulate it:

import requests

url = "http://stats.nba.com/league/team/#!/advanced/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}

with requests.Session() as session:
    session.headers = headers
    session.get(url, headers=headers)

    params = {
        'DateFrom': '',
        'DateTo': '',
        'GameScope': '',
        'GameSegment': '',
        'LastNGames': '0',
        'LeagueID': '00',
        'Location': '',
        'MeasureType': 'Advanced',
        'Month': '0',
        'OpponentTeamID': '0',
        'Outcome': '',
        'PaceAdjust': 'N',
        'PerMode': 'Totals',
        'Period': '0',
        'PlayerExperience': '',
        'PlayerPosition': '',
        'PlusMinus': 'N',
        'Rank': 'N',
        'Season': '2014-15',
        'SeasonSegment': '',
        'SeasonType': 'Regular Season',
        'StarterBench': '',
        'VsConference': '',
        'VsDivision': ''
    }

    response = session.get('http://stats.nba.com/stats/leaguedashteamstats', params=params)
    results = response.json()
    headers = results['resultSets'][0]['headers']
    rows = results['resultSets'][0]['rowSet']
    for row in rows:
        print(dict(zip(headers, row)))

Prints:

{u'MIN': 2074.0, u'TEAM_ID': 1610612737, u'TEAM_NAME': u'Atlanta Hawks', u'AST_PCT': 0.687, u'CFPARAMS': u'Atlanta Hawks', u'EFG_PCT': 0.531, u'DEF_RATING': 99.4, u'NET_RATING': 7.5, u'PIE': 0.556, u'AST_TO': 1.81, u'TS_PCT': 0.57, u'GP': 43, u'L': 8, u'OREB_PCT': 0.21, u'REB_PCT': 0.488, u'W': 35, u'W_PCT': 0.814, u'DREB_PCT': 0.743, u'CFID': 10, u'PACE': 96.17, u'TM_TOV_PCT': 0.149, u'AST_RATIO': 19.9, u'OFF_RATING': 106.9}
{u'MIN': 1897.0, u'TEAM_ID': 1610612738, u'TEAM_NAME': u'Boston Celtics', u'AST_PCT': 0.635, u'CFPARAMS': u'Boston Celtics', u'EFG_PCT': 0.494, u'DEF_RATING': 104.0, u'NET_RATING': -2.7, u'PIE': 0.489, u'AST_TO': 1.73, u'TS_PCT': 0.527, u'GP': 39, u'L': 26, u'OREB_PCT': 0.245, u'REB_PCT': 0.496, u'W': 13, u'W_PCT': 0.333, u'DREB_PCT': 0.747, u'CFID': 10, u'PACE': 99.12, u'TM_TOV_PCT': 0.145, u'AST_RATIO': 18.5, u'OFF_RATING': 101.3}
...

Selenium-based solution:

from pprint import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get('http://stats.nba.com/league/team/#!/advanced/')
wait = WebDriverWait(driver, 5)

# wait for the table to load
table = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table-responsive')))

stats = []
headers = [th.text for th in table.find_elements_by_tag_name('th')]
for tr in table.find_elements_by_xpath('//tr[@data-ng-repeat]'):
    cells = [td.text for td in tr.find_elements_by_tag_name('td')]

    stats.append(dict(zip(headers, cells)))

pprint(stats)

driver.quit()

Prints:

[{u'AST Ratio': u'19.8',
  u'AST%': u'68.1',
  u'AST/TO': u'1.84',
  u'DREB%': u'74.3',
  u'DefRtg': u'100.2',
  u'GP': u'51',
  u'MIN': u'2458',
  u'NetRtg': u'7.4',
  u'OREB%': u'21.0',
  u'OffRtg': u'107.7',
  u'PACE': u'96.12',
  u'PIE': u'55.3',
  u'REB%': u'48.8',
  u'TO Ratio': u'14.6',
  u'TS%': u'57.2',
  u'Team': u'Atlanta Hawks',
  u'eFG%': u'53.4'},
  ...
 {u'AST Ratio': u'18.6',
  u'AST%': u'62.8',
  u'AST/TO': u'1.65',
  u'DREB%': u'77.8',
  u'DefRtg': u'100.2',
  u'GP': u'52',
  u'MIN': u'2526',
  u'NetRtg': u'3.5',
  u'OREB%': u'24.9',
  u'OffRtg': u'103.7',
  u'PACE': u'95.75',
  u'PIE': u'53.4',
  u'REB%': u'51.8',
  u'TO Ratio': u'15.4',
  u'TS%': u'54.4',
  u'Team': u'Washington Wizards',
  u'eFG%': u'50.9'}]

Upvotes: 4

salmanwahed
salmanwahed

Reputation: 9657

The reason behind not getting the data from the first url using requests.get() is, the data is fetched from the server using an ajax call. And the ajax call url is http://stats.nba.com/stats/leaguedashteamstats. You have to pass some parameters with it.

When making a requests.get() call you will only get those data that show in the page source of your browser. In your browser press ctrl+u to see the page source and you can see that, there is no data in the source.

In chrome browser use the developer tools and see in the Network tab what requests the page is making. In firefox you can use firebug and see in Net tab.

In case of second url the page source is populated with data(View page source to examine). So you can get it by making the a get request in that specific url.

alecxe's answer demonstrate how to get data from the first url.

Upvotes: 2

Related Questions