Reputation: 434
So I am trying to get some data from a website. And I'm having a hard time getting the data. I can get the player names but thats about it at this point. Been trying different things coming up short. Here is sample code that i'm trying to go through. Note that there are two tables (one for each team). And the class for each player alternates from "even" to "odd" or "odd" to "even" example html file below followed by my python script. I labeled which parts I want. I am also using python 2.7
`<table id="nbaGITeamStats" cellpadding="0" cellspacing="0">
<thead class="nbaGIClippers">
<tr>
<th colspan="17">Los Angeles Clippers (1-0)</th> <!-- I want team name -->
</tr>
</thead>
<tbody><tr colspan="17">
<td colspan="17" class="nbaGIBoxCat"><span>field goals</span><span>rebounds</span></td>
</tr>
<tr>
<td class="nbaGITeamHdrStatsNoBord" colspan="1"> </td>
<td class="nbaGITeamHdrStats">pos</td>
<td class="nbaGITeamHdrStats">min</td>
<td class="nbaGITeamHdrStats">fgm-a</td>
<td class="nbaGITeamHdrStats">3pm-a</td>
<td class="nbaGITeamHdrStats">ftm-a</td>
<td class="nbaGITeamHdrStats">+/-</td>
<td class="nbaGITeamHdrStats">off</td>
<td class="nbaGITeamHdrStats">def</td>
<td class="nbaGITeamHdrStats">tot</td>
<td class="nbaGITeamHdrStats">ast</td>
<td class="nbaGITeamHdrStats">pf</td>
<td class="nbaGITeamHdrStats">st</td>
<td class="nbaGITeamHdrStats">to</td>
<td class="nbaGITeamHdrStats">bs</td>
<td class="nbaGITeamHdrStats">ba</td>
<td class="nbaGITeamHdrStats">pts</td>
</tr>
<tr class="odd">
<td id="nbaGIBoxNme" class="b"><a href="/playerfile/paul_pierce/index.html">P. Pierce</a></td> <!-- I want player name -->
<td class="nbaGIPosition">F</td> <!-- I want position name -->
<td>14:16</td> <!-- I want this -->
<td>1-4</td> <!-- I want this -->
<td>1-2</td> <!-- I want this -->
<td>2-2</td> <!-- I want this -->
<td>+12</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>3</td> <!-- I want this -->
<td>2</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>5</td> <!-- I want this -->
</tr>
<tr class="even">
<td id="nbaGIBoxNme" class="b"><a href="/playerfile/blake_griffin/index.html">B. Griffin</a></td> <!-- I want this -->
<td class="nbaGIPosition">F</td> <!-- I want this -->
<td>26:19</td> <!-- I want this -->
<td>5-14</td> <!-- I want this -->
<td>0-1</td> <!-- I want this -->
<td>1-1</td> <!-- I want this -->
<td>+14</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>5</td> <!-- I want this -->
<td>5</td> <!-- I want this -->
<td>2</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>11</td> <!-- I want this -->
</tr>
<tr class="odd">
<td id="nbaGIBoxNme" class="b"><a href="/playerfile/deandre_jordan/index.html">D. Jordan</a></td> <!-- I want this -->
<td class="nbaGIPosition">C</td> <!-- I want this -->
<td>26:27</td> <!-- I want this -->
<td>6-7</td> <!-- I want this -->
<td>0-0</td> <!-- I want this -->
<td>3-5</td> <!-- I want this -->
<td>+19</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>11</td> <!-- I want this -->
<td>12</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>2</td> <!-- I want this -->
<td>3</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>15</td> <!-- I want this -->
</tr>
<!-- And so on it will keep changing class from odd to even, even to odd -->
<!-- Also note there are to tables one for each team -->
<!--this is he table id>>> <table id="nbaGITeamStats" cellpadding="0" cellspacing="0"> -->`
This was long but i wanted to give an example of the classes switching up here is my python script I plan to use a dictionary to save the data once I actually scrape it successfully.
import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
for tr in soup.find_all('table id="nbaGITeamStats'):
tds = tr.find_all('td')
print tds
Upvotes: 3
Views: 6056
Reputation: 434
So here is what I did to get everything to come up. Of course I'll have to clean the code from here and this was with great help from sal.
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rowsodd = table.find_all(attrs={'class':'odd'})
rowseven =table.find_all(attrs={'class':'even'})
for player in rowsodd:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
#player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
#^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_numbers
for player in rowseven:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
#player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
#^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_numbers
Everything now shows up. and I will have to clean it up a bit better. But the data is a lot cleaner. I actually never used Beautiful soup as you can tell from the question. Two rows was needed or perhaps someone knows a better way this was easiest for me to get the data I was looking for always looking to improve though. I hope someone else learns from this.
Upvotes: 1
Reputation: 2849
It's correct to write this way:
for tr in soup.find_all('table', id='nbaGITeamStats')
That works fine for me (python 3.4):
>>> import requests
>>> from bs4 import BeautifulSoup
>>> gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
>>>
>>> for game in gamesForDay:
... url = "http://www.nba.com/"+game
... page = requests.get(url).content
... soup = BeautifulSoup(page, 'html.parser')
... for tr in soup.find_all('table', id='nbaGITeamStats'):
... tds = tr.find_all('td')
... print(tds)
To access content inside td tag use .text, like this:
for td in tds:
print(td.text)
Upvotes: 2
Reputation: 3593
Here my solution. Note that I have a slightly different version of BeautifulSoup, not one coming from bs4, but the logic might not be too off. Still on Python2.7 (on Windows in my case).
You will likely need to fix some nuances for player sections that are not as you display above, but I think you'll be able to handle that part :-)
import urllib
import urllib2
# from bs4 import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rows = [ x for x in table.findAll('tr') if x.get('class',None) in ['odd','even'] ]
for player in rows:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_position, player_numbers
With bs4 (BeautifulSoup4 as I learned) some modifications had to be done. You still have to handle some stuff, but this extract most of the data you want:
import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rows = table.find_all(attrs={'class':'odd'})
rows.extend(table.find_all(attrs={'class':'even'}))
for player in rows:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_position, player_numbers
Upvotes: 2