Reputation: 368
I want to parse through a webpage like this and gather only the names of starters:
http://espn.go.com/nba/boxscore?gameId=400827888
My script grabs all the names on the page, but I cannot discriminate when the starters for the team on the bottom (in this case Atlanta) starts and where the bench players for the team on the top (in this case Detroit) ends. The real problem is that the top team can have anywhere from 11-15 guys on their listed roster, so I can't just split by a number as I understand.
As written, this gives me the first 10 names of the Pistons -- not the first five of the Pistons, the first 5 of the Hawks. One strategy I thought of relies on logos, but that seems very tricky given the way they are coded in the HTML.
def parse_boxscore(url):
"""Gathers names of starters from both teams, stores in list"""
soup = make_soup(url)
starters = [td for td in soup.findAll("td", "name")]
return starters[0:5], starters[6:11]
Can anyone think of a consistent strategy? I'm not very technically savvy so I will sacrifice relative efficiency for simplicity (I know, I know)...
Upvotes: 1
Views: 236
Reputation: 180401
If all you want are the starters it is pretty straight forward, just pull the first tbody inside the div.content.hide-bench and extract the text from the td.name tags:
import requests
from bs4 import BeautifulSoup
teams = {}
page = requests.get('http://espn.go.com/nba/boxscore?gameId=400827888')
soup = BeautifulSoup(page.content)
for table in soup.select("div.content.hide-bench"):
team = table.select_one("div.table-caption").find(text=True)
teams[team] = [tr.select_one("td.name").text for tr in table.find("tbody").find_all("tr")]
from pprint import pprint as pp
pp(teams)
Which gives you:
{'Hawks': ['P. MillsapPF',
'K. BazemoreSF',
'A. HorfordC',
'J. TeaguePG',
'K. KorverSG'],
'Pistons': ['M. MorrisPF',
'E. IlyasovaPF',
'A. DrummondC',
'R. JacksonPG',
'K. Caldwell-PopeSG']}
Upvotes: 1
Reputation: 36545
If you use pandas instead of beautiful soup it will parse out the tables separately. it only gets the starters, not the bench players though, so hopefully this isn't an issue.
import pandas as pd
pd.read_html('http://www.espn.com.au/nba/boxscore?gameId=400827888')
[ Unnamed: 0 1 2 3 4 T
0 DET 25 23 34 24 106
1 ATL 25 18 23 28 94,
starters MIN FG 3PT FT OREB DREB REB AST STL BLK \
0 M. MorrisPF 37 6-19 1-4 5-6 5 5 10 4 0 0
1 E. IlyasovaPF 34 6-12 3-6 1-2 3 4 7 3 0 1
2 A. DrummondC 37 6-16 0-0 6-10 8 11 19 3 1 2
3 R. JacksonPG 32 4-10 2-4 5-5 1 7 8 5 2 0
4 K. Caldwell-PopeSG 37 7-14 4-7 3-3 1 3 4 1 1 0
TO PF +/- PTS
0 0 1 17 18
1 3 4 20 16
2 2 4 23 18
3 2 0 26 15
4 2 1 17 21 ,
starters MIN FG 3PT FT OREB DREB REB AST STL BLK TO PF \
0 P. MillsapPF 36 7-15 2-6 3-4 1 7 8 3 0 0 2 4
1 K. BazemoreSF 21 0-3 0-1 0-0 0 7 7 1 0 0 4 3
2 A. HorfordC 30 6-11 1-3 2-3 1 3 4 4 2 3 1 1
3 J. TeaguePG 32 7-16 1-3 3-4 0 2 2 4 0 0 5 1
4 K. KorverSG 29 3-9 1-5 0-0 0 2 2 1 1 0 1 4
+/- PTS
0 -22 19
1 -17 0
2 -5 15
3 -23 18
4 -9 7 ,
TEAM W L PCT GB STRK
0 Cleveland 57 25 0.695 0 L1
1 Indiana 45 37 0.549 12 W3
2 Detroit 44 38 0.537 13 W1
3 Chicago 42 40 0.512 15 W3
4 Milwaukee 33 49 0.402 24 L2,
TEAM W L PCT GB STRK
0 Miami 48 34 0.585 0 L1
1 Atlanta 48 34 0.585 0 L2
2 Charlotte 48 34 0.585 0 W2
3 Washington 41 41 0.500 7 W3
4 Orlando 35 47 0.427 13 L1]
Upvotes: 0