DrDunkenstein
DrDunkenstein

Reputation: 33

Problems Parsing NBA Boxscore Data with BeautifulSoup

I am trying to parse player level NBA boxscore data from EPSN. The following is the initial portion of my attempt:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

request = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(request.text,'html.parser')
table = soup.find_all('table')

It seems that BeautifulSoup is giving me a strange result. The last 'table' in the source code contains the player data and that is what I want to extract. Looking at the source code online shows that this table is closed at line 421, which is AFTER both teams' box scores. However, if we look at 'soup', there is an added line that closes the table BEFORE the Miami stats. This occurs at line 350 in the online source code.

The output from the parser 'html.parser' is:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4 T

BOS 25 29 22 31107MIA 31 31 31 27120

Boston Celtics
STARTERS    
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Kevin Garnett, PF324-80-01-11111220254-49
Brandon Bass, PF286-110-03-4651110012-815
Paul Pierce, SF416-152-49-905552003-1723
Rajon Rondo, PG449-140-22-4077130044-1320
Courtney Lee, SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Jared Sullinger, PF81-20-00-001100001-32
Jeff Green, SF230-40-03-403301010-73
Jason Terry, SG252-70-34-400011033-108
Leandro Barbosa, SG166-83-31-201110001+416
Chris Wilcox, PFDNP COACH'S DECISION
Kris Joseph, SFDNP COACH'S DECISION
Jason Collins, CDNP COACH'S DECISION
Darko Milicic, CDNP COACH'S DECISIONTOTALS
FGM-A
3PM-A  
FTM-A
OREB

As you can see, it ends mid-table at 'OREB' and it never makes it to the Miami Heat section. The output using 'lxml' parser is:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4T

BOS 25 29 22 31107MIA 31 31 31 27120

This doesn't include the box scores at all. The complete code I'm using (due to Daniel Rodriguez) looks something like:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

games = pd.read_csv('games_13.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'

request = requests.get(BASE_URL.format(games.index[0]))
table = BeautifulSoup(request.text,'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers

players = pd.DataFrame(columns=columns)

def get_players(players, team_name):
    array = np.zeros((len(players), len(headers)+1), dtype=object)
    array[:] = np.nan
    for i, player in enumerate(players):
        cols = player.find_all('td')
        array[i, 0] = cols[0].text.split(',')[0]
        for j in range(1, len(headers) + 1):
            if not cols[1].text.startswith('DNP'):
                array[i, j] = cols[j].text

    frame = pd.DataFrame(columns=columns)
    for x in array:
        line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
        new = pd.DataFrame(line, columns=frame.columns)
        frame = frame.append(new)
    return frame

for index, row in games.iterrows():
    print(index)
    request = requests.get(BASE_URL.format(index))
    table = BeautifulSoup(request.text, 'html.parser').find('table', class_='mod-data')
    heads = table.find_all('thead')
    bodies = table.find_all('tbody')

    team_1 = heads[0].th.text
    team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
    team_1_players = get_players(team_1_players, team_1)
    players = players.append(team_1_players)

    team_2 = heads[3].th.text
    team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
    team_2_players = get_players(team_2_players, team_2)
    players = players.append(team_2_players)

players = players.set_index('id')
print(players)
players.to_csv('players_13.csv')

A sample of the output I'd like is:

,id,team,player,MIN,FGM-A,3PM-A,FTM-A,OREB,DREB,REB,AST,STL,BLK,TO,PF,+/-,PTS
0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26

Upvotes: 1

Views: 2053

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180481

The code returns the correct data using the default parser which will probably lxml if you have it installed:

req = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(req.content)
table = soup.find_all('table')
print(table)

....................
<td nowrap="" style="text-align:left"><a href="http://espn.go.com/nba/player/_/id/2009/james-jones">James Jones</a>, SF</td><td colspan="14" style="text-align:center">DNP COACH'S DECISION</td></tr><tr align="right" class="odd player-46-6490" valign="middle">
<td nowrap="" style="text-align:left"><a href="http://espn.go.com/nba/player/_/id/6490/terrel-harris">Terrel Harris</a>, SG</td><td colspan="14" style="text-align:center">DNP COACH'S DECISION</td></tr></tbody><thead><tr align="right"><th style="text-align:left;">TOTALS</th><th></th>
<th nowrap="">FGM-A</th>
<th>3PM-A</th>
<th>FTM-A</th>
<th>OREB
</th><th>DREB</th>
<th>REB</th>
<th>AST</th>
<th>STL</th>
<th>BLK</th>
<th>TO</th>
<th>PF</th>
<th> </th>
<th>PTS</th>
</tr></thead><tbody><tr align="right" class="even"><td colspan="2" style="text-align:left"></td><td><strong>43-79</strong></td><td><strong>8-16</strong></td><td><strong>26-32</strong></td><td><strong>5</strong></td><td><strong>31</strong></td><td><strong>36</strong></td><td><strong>25</strong></td><td><strong>8</strong></td><td><strong>5</strong></td><td><strong>8</strong></td><td><strong>20</strong></td><td> </td><td><strong>120</strong></td></tr><tr align="right" class="odd"><td colspan="2" style="text-align:left"><strong></strong></td><td><strong>54.4%</strong></td><td><strong>50.0%</strong></td><td><strong>81.3%</strong></td><td colspan="13"></td></tr><tr bgcolor="#ffffff"><td align="right" colspan="15" style="padding:10px;"><div style="float: right;"><strong>Fast break points:</strong>   12<br/><strong>Points in the paint:</strong>   46<br/><strong>Total Team Turnovers (Points off turnovers):</strong>   8 (6)</div><div style="float: left;">+/- denotes team's net points while the player is on the court.</div></td></tr></tbody></table>]

Using "html.parser" gave the the same truncated output as in your question but as you can see above without specifying it works fine.

It is working on both python 2.7 and 3.4 using bs4 '4.3.2', my lxml version is 3.3.3.0.

If you have not got the latest bs4 you should update, you can use the diagnose method which will print out a report showing you how different parsers handle the document, and tell you if you’re missing a parser that Beautiful Soup could be using:

So with your html use the following to get a report:

from bs4.diagnose import diagnose
diagnose(request.text)

Using a regex to parse html has been well documented as not being a very good approach, a trivial change to the html and the regex can break.

Upvotes: 0

Jonathan Epstein
Jonathan Epstein

Reputation: 379

BeautifulSoup truncated part of the results for me as well, so I replaced soup.find_all option with re.findall

r = br.open('http://espn.go.com/nba/boxscore?gameId=400277722')
html = r.read()
soup = BeautifulSoup(html)

statnames = re.search('STARTERS</th>.*?PTS</th>',html, re.DOTALL).group()
th = re.findall('th.*</th', statnames) # each th tag contains a statname
names = ['Name', 'Team']
for t in th:
   t = re.sub('.*>','',t)
   t = t.replace('</th','')
   names.append(t)
print names

celts = re.search('Boston Celtics.*?Total Team Turnovers',html,re.DOTALL).group()
heat = re.search('nba-small-mia floatleft.*?Total Team Turnovers',html,re.DOTALL).group()

players = str(soup).split('td nowrap')
for player in players[1:len(players)]:
   try:
       stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
   except:
       stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()] # player name
       if stats[0] in celts:
          stats.append('Boston Celtics')
       elif stats[0] in heat:
          stats.append('Miami Heat')
   td = re.findall('td.*?/td', player) # each td tag contains a stat
   for t in td:
       t = re.findall('>.*<',t)
       t = re.sub('.*>','',t[0])
       t = t.replace('<','')
       if t!='' and t!='\xc2\xa0':
          stats.append(t)
    print stats

output =

['Name', 'Team', 'MIN', 'FGM-A', '3PM-A', 'FTM-A', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', '+/-', 'PTS']
['Kevin Garnett', 'Boston Celtics', '32', '4-8', '0-0', '1-1', '1', '11', '12', '2', '0', '2', '5', '4', '-4', '9']
['Brandon Bass', 'Boston Celtics', '28', '6-11', '0-0', '3-4', '6', '5', '11', '1', '0', '0', '1', '2', '-8', '15']
['Paul Pierce', 'Boston Celtics', '41', '6-15', '2-4', '9-9', '0', '5', '5', '5', '2', '0', '0', '3', '-17', '23']
['Rajon Rondo', 'Boston Celtics', '44', '9-14', '0-2', '2-4', '0', '7', '7', '13', '0', '0', '4', '4', '-13', '20']
['Courtney Lee', 'Boston Celtics', '24', '5-6', '1-1', '0-0', '0', '1', '1', '1', '0', '0', '1', '5', '-7', '11']
['Jared Sullinger', 'Boston Celtics', '8', '1-2', '0-0', '0-0', '0', '1', '1', '0', '0', '0', '0', '1', '-3', '2']
['Jeff Green', 'Boston Celtics', '23', '0-4', '0-0', '3-4', '0', '3', '3', '0', '1', '0', '1', '0', '-7', '3']
['Jason Terry', 'Boston Celtics', '25', '2-7', '0-3', '4-4', '0', '0', '0', '1', '1', '0', '3', '3', '-10', '8']
['Leandro Barbosa', 'Boston Celtics', '16', '6-8', '3-3', '1-2', '0', '1', '1', '1', '0', '0', '0', '1', '+4', '16']
['Chris Wilcox', 'Boston Celtics', "DNP COACH'S DECISION"]
['Kris Joseph', 'Boston Celtics', "DNP COACH'S DECISION"]
['Jason Collins', 'Boston Celtics', "DNP COACH'S DECISION"]
['Darko Milicic', 'Boston Celtics', "DNP COACH'S DECISION"]
['Shane Battier', 'Miami Heat', '29', '2-4', '2-3', '0-0', '0', '2', '2', '1', '1', '0', '0', '3', '+12', '6']
['LeBron James', 'Miami Heat', '29', '10-16', '2-4', '4-5', '1', '9', '10', '3', '2', '0', '0', '2', '+12', '26']
['Chris Bosh', 'Miami Heat', '37', '8-15', '0-1', '3-4', '2', '8', '10', '1', '0', '3', '1', '3', '+15', '19']
['Mario Chalmers', 'Miami Heat', '36', '3-7', '0-1', '2-2', '0', '1', '1', '11', '3', '0', '1', '3', '+11', '8']
['Dwyane Wade', 'Miami Heat', '35', '10-22', '0-0', '9-11', '2', '1', '3', '4', '2', '1', '4', '3', '-6', '29']
['Udonis Haslem', 'Miami Heat', '11', '0-1', '0-0', '0-0', '0', '3', '3', '0', '0', '0', '1', '1', '-2', '0']
['Rashard Lewis', 'Miami Heat', '19', '4-5', '1-2', '1-2', '0', '5', '5', '1', '0', '1', '0', '1', '+1', '10']
['Norris Cole', 'Miami Heat', '6', '1-2', '1-2', '0-0', '0', '0', '0', '1', '0', '0', '1', '2', '+5', '3']
['Ray Allen', 'Miami Heat', '31', '5-7', '2-3', '7-8', '0', '2', '2', '2', '0', '0', '0', '1', '+9', '19']
['Mike Miller', 'Miami Heat', '7', '0-0', '0-0', '0-0', '0', '0', '0', '1', '0', '0', '0', '1', '+8', '0']
['Josh Harrellson', 'Miami Heat', "DNP COACH'S DECISION"]
['James Jones', 'Miami Heat', "DNP COACH'S DECISION"]
['Terrel Harris', 'Miami Heat', "DNP COACH'S DECISION"]

To catch D.J. Augustine, the simplest (but not least concise) code is:

try:
    stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
except:
    stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()]

Upvotes: 3

Gerard Rozsavolgyi
Gerard Rozsavolgyi

Reputation: 5074

Try using a different parser (lxml):

soup = BeautifulSoup(request.text,'lxml')
tables = soup.find_all('table')

for t in tables:
    print t.text

It will better detect the page structure

Upvotes: 0

Related Questions