BSHuniversity
BSHuniversity

Reputation: 368

Beautiful soup strategy for splitting data

I want to parse through a webpage like this and gather only the names of starters:

http://espn.go.com/nba/boxscore?gameId=400827888

My script grabs all the names on the page, but I cannot discriminate when the starters for the team on the bottom (in this case Atlanta) starts and where the bench players for the team on the top (in this case Detroit) ends. The real problem is that the top team can have anywhere from 11-15 guys on their listed roster, so I can't just split by a number as I understand.

As written, this gives me the first 10 names of the Pistons -- not the first five of the Pistons, the first 5 of the Hawks. One strategy I thought of relies on logos, but that seems very tricky given the way they are coded in the HTML.

def parse_boxscore(url):
    """Gathers names of starters from both teams, stores in list"""
    soup = make_soup(url)
    starters = [td for td in soup.findAll("td", "name")]
    return starters[0:5], starters[6:11]

Can anyone think of a consistent strategy? I'm not very technically savvy so I will sacrifice relative efficiency for simplicity (I know, I know)...

Upvotes: 1

Views: 236

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

If all you want are the starters it is pretty straight forward, just pull the first tbody inside the div.content.hide-bench and extract the text from the td.name tags:

import requests
from bs4 import BeautifulSoup
teams = {}
page = requests.get('http://espn.go.com/nba/boxscore?gameId=400827888')

soup = BeautifulSoup(page.content)


for table in soup.select("div.content.hide-bench"):
    team = table.select_one("div.table-caption").find(text=True)
    teams[team] = [tr.select_one("td.name").text for tr in table.find("tbody").find_all("tr")]
from pprint import pprint as pp
pp(teams)

Which gives you:

{'Hawks': ['P. MillsapPF',
           'K. BazemoreSF',
           'A. HorfordC',
           'J. TeaguePG',
           'K. KorverSG'],
 'Pistons': ['M. MorrisPF',
             'E. IlyasovaPF',
             'A. DrummondC',
             'R. JacksonPG',
             'K. Caldwell-PopeSG']}

Upvotes: 1

maxymoo
maxymoo

Reputation: 36545

If you use pandas instead of beautiful soup it will parse out the tables separately. it only gets the starters, not the bench players though, so hopefully this isn't an issue.

import pandas as pd 
pd.read_html('http://www.espn.com.au/nba/boxscore?gameId=400827888')

[  Unnamed: 0   1   2   3   4    T
 0        DET  25  23  34  24  106
 1        ATL  25  18  23  28   94,
              starters  MIN    FG  3PT    FT  OREB  DREB  REB  AST  STL  BLK  \
 0         M. MorrisPF   37  6-19  1-4   5-6     5     5   10    4    0    0
 1       E. IlyasovaPF   34  6-12  3-6   1-2     3     4    7    3    0    1
 2        A. DrummondC   37  6-16  0-0  6-10     8    11   19    3    1    2
 3        R. JacksonPG   32  4-10  2-4   5-5     1     7    8    5    2    0
 4  K. Caldwell-PopeSG   37  7-14  4-7   3-3     1     3    4    1    1    0

    TO  PF  +/-  PTS
 0   0   1   17   18
 1   3   4   20   16
 2   2   4   23   18
 3   2   0   26   15
 4   2   1   17   21  ,
         starters  MIN    FG  3PT   FT  OREB  DREB  REB  AST  STL  BLK  TO  PF  \
 0   P. MillsapPF   36  7-15  2-6  3-4     1     7    8    3    0    0   2   4
 1  K. BazemoreSF   21   0-3  0-1  0-0     0     7    7    1    0    0   4   3
 2    A. HorfordC   30  6-11  1-3  2-3     1     3    4    4    2    3   1   1
 3    J. TeaguePG   32  7-16  1-3  3-4     0     2    2    4    0    0   5   1
 4    K. KorverSG   29   3-9  1-5  0-0     0     2    2    1    1    0   1   4

    +/-  PTS
 0  -22   19
 1  -17    0
 2   -5   15
 3  -23   18
 4   -9    7  ,
         TEAM   W   L    PCT  GB STRK
 0  Cleveland  57  25  0.695   0   L1
 1    Indiana  45  37  0.549  12   W3
 2    Detroit  44  38  0.537  13   W1
 3    Chicago  42  40  0.512  15   W3
 4  Milwaukee  33  49  0.402  24   L2,
          TEAM   W   L    PCT  GB STRK
 0       Miami  48  34  0.585   0   L1
 1     Atlanta  48  34  0.585   0   L2
 2   Charlotte  48  34  0.585   0   W2
 3  Washington  41  41  0.500   7   W3
 4     Orlando  35  47  0.427  13   L1]

Upvotes: 0

Related Questions