Reputation: 269
I want to get a couple tables from various players. When it searches someone like Sergio Rodriguez, multiple names come up (https://basketball.realgm.com/search?q=Sergio+Rodriguez), so instead of going to the individual page, it spits out "No international table for Sergio Rodriguez." Out of the three, I want to go into the individual page of the Sergio Rodriguez that played in the NBA, who is second in the list, and scrape tables, but I'm not sure how to go about it. How do i use the rel since that's the only way it seems like this would work. The pseudocode is there if that helps. Thanks.
The HTML:
<tbody>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez Febles, Sergio"><a href="/player/Sergio-Rodriguez-Febles/Summary/50443">Sergio Rodriguez Febles</a></td>
<td class="nowrap" rel="5">SF</td>
<td class="nowrap" rel="79">6-7</td>
<td class="nowrap" rel="202">202</td>
<td class="nowrap" rel="19931018"><a href="/info/birthdays/19931018/1">Oct 18, 1993</a></td>
<td class="nowrap" rel="2015"><a href="/nba/draft/past_drafts/2015" target="_blank">2015</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/85">Sergio Rodriguez</a></td>
<td class="nowrap" rel="1">PG</td>
<td class="nowrap" rel="75">6-3</td>
<td class="nowrap" rel="176">176</td>
<td class="nowrap" rel="19860612"><a href="/info/birthdays/19860612/1">Jun 12, 1986</a></td>
<td class="nowrap" rel="2006"><a href="/nba/draft/past_drafts/2006" target="_blank">2006</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="NYK, PHL, POR, SAC"><a href="/nba/teams/New-York-Knicks/20/Rosters/Regular/2010">NYK</a>, <a href="/nba/teams/Philadelphia-Sixers/22/Rosters/Regular/2017">PHL</a>, <a href="/nba/teams/Portland-Trail-Blazers/24/Rosters/Regular/2009">POR</a>, <a href="/nba/teams/Sacramento-Kings/25/Rosters/Regular/2010">SAC</a></td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/39601">Sergio Rodriguez</a></td>
<td class="nowrap" rel="3">SG</td>
<td class="nowrap" rel="76">6-4</td>
<td class="nowrap" rel="-">-</td>
<td class="nowrap" rel="19771012"><a href="/info/birthdays/19771012/1">Oct 12, 1977</a></td>
<td class="nowrap" rel="1999"><a href="/nba/draft/past_drafts/1999" target="_blank">1999</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
</tbody>
import requests
from bs4 import BeautifulSoup
import pandas as pd
playernames=['Carlos Delfino', 'Sergio Rodriguez']
result = pd.DataFrame()
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# check the response url
if (response.url == "https://basketball.realgm.com/search..."):
# parse the search results, finding players who played in NBA
... get urls from the table ...
soup.table... # etc.
foreach url in table:
response = requests.get(player_url)
soup = BeautifulSoup(response.content, 'html.parser')
# call the parse function for a player page
...
parse_player(soup)
else: # we have a player page
# call the parse function for a player page, same as above
...
parse_player(soup)
try:
table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')
df1 = pd.read_html(str(table1))[0]
df2 = pd.read_html(str(table2))[0]
commonCols = list(set(df1.columns) & set(df2.columns))
df = df1.merge(df2, how='left', on=commonCols)
df['Player'] = name
except:
print ('No international table for %s.' %name)
df = pd.DataFrame([name], columns=['Player'])
Upvotes: 0
Views: 66
Reputation: 578
Pandas has a very useful method to read html directly. This is especially useful if you are looking to get information from tables, as is applicable to you. Basically, pandas will scrape the website for any tables and read the tables as dataframes. Read more about it here
The problem here is that you need to access the link of the player as well and the read_html
method reads the table as text and does not consider the tags.
Nevertheless, I found a possible solution. It is by no means the best one, but hopefully, you can use and improve it.
The approach is:
read_html
methodNBA != '-'
)Sergio Rodriguez
, but only the 2nd one has played the NBA - you would need this index , i.e index=1
(assume starting index is 0) to lookup the link laterSergio Rodriguez
Sergio Rodriguez
import pandas as pd
import requests
from bs4 import BeautifulSoup
# read the data from the website as a list of dataframes (tables)
web_data = pd.read_html('https://basketball.realgm.com/search?q=Sergio+Rodriguez')
# the table you need is the second to last one
required_table = web_data[len(web_data)-2]
print (required_table)
>>>
Player Pos HT WT Birth Date Draft Year College NBA
0 Sergio Rodriguez Febles SF 6-7 202 Oct 18, 1993 2015 - -
1 Sergio Rodriguez PG 6-3 176 Jun 12, 1986 2006 - NYK, PHL, POR, SAC
2 Sergio Rodriguez SG 6-4 - Oct 12, 1977 1999 - -
### get the player name who has played in NBA
required_player_name = required_table.loc[required_table['NBA']!='-']['Player'].values[0]
print (required_player_name)
>>>
Sergio Rodriguez
## check for duplicate players with this name (reset index so that we get the indices of player with the same name in order)
table_with_player = required_table.loc[(required_table['Player']==required_player_name)].reset_index(drop=True)
# get the indices of player where NBA is not '-'
index_of_player_to_get = list(table_with_player[table_with_player['NBA']!='-'].index)[0]
print (index_of_player_to_get)
### basically if indices_of_player_to_get = 2 (say) then we need the 3rd link with player name == required_player_name
>>>
0
Now we can read in all the links and pull out the link at the position of index_of_player_to_get
among all links with name Sergio Rodriguez
url='https://basketball.realgm.com/search?q=Sergio+Rodriguez'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
## get all links
all_links = soup.find_all('a', href=True)
link_idx = -1
for link in all_links:
if link.text == required_player_name:
# player name found, inc link_idx
link_idx+=1
if link_idx == index_of_player_to_get:
print (link['href'])
>>>
/player/Sergio-Rodriguez/Summary/85
Upvotes: 1
Reputation: 6132
So, as you know your rel
is always in the eigth column of the table, you can do something like this:
soup = BeautifulSoup(html)
rows = [row for row in soup.find_all('tr')] # Get each row from the table
eighth_text = [col.find_all('td')[7].text for col in rows] # get text from eighth column
idx = [n for n,i in enumerate(eighth_text) if i!='-'] #Get the index of all rows that have text (are NBA players)
Then you can access that (or those) player(s) with something like:
for i in idx:
print(rows[i].a)
Or whatever attribute you're looking for. Probably there are way more pythonic ways, but I priorize ease of understanding.
Upvotes: 0