J. Doe
J. Doe

Reputation: 269

Getting link and scraping table from list

I want to get a couple tables from various players. When it searches someone like Sergio Rodriguez, multiple names come up (https://basketball.realgm.com/search?q=Sergio+Rodriguez), so instead of going to the individual page, it spits out "No international table for Sergio Rodriguez." Out of the three, I want to go into the individual page of the Sergio Rodriguez that played in the NBA, who is second in the list, and scrape tables, but I'm not sure how to go about it. How do i use the rel since that's the only way it seems like this would work. The pseudocode is there if that helps. Thanks.

The HTML:

<tbody>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez Febles, Sergio"><a href="/player/Sergio-Rodriguez-Febles/Summary/50443">Sergio Rodriguez Febles</a></td>
<td class="nowrap" rel="5">SF</td>
<td class="nowrap" rel="79">6-7</td>
<td class="nowrap" rel="202">202</td>
<td class="nowrap" rel="19931018"><a href="/info/birthdays/19931018/1">Oct 18, 1993</a></td>
<td class="nowrap" rel="2015"><a href="/nba/draft/past_drafts/2015" target="_blank">2015</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/85">Sergio Rodriguez</a></td>
<td class="nowrap" rel="1">PG</td>
<td class="nowrap" rel="75">6-3</td>
<td class="nowrap" rel="176">176</td>
<td class="nowrap" rel="19860612"><a href="/info/birthdays/19860612/1">Jun 12, 1986</a></td>
<td class="nowrap" rel="2006"><a href="/nba/draft/past_drafts/2006" target="_blank">2006</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="NYK, PHL, POR, SAC"><a href="/nba/teams/New-York-Knicks/20/Rosters/Regular/2010">NYK</a>, <a href="/nba/teams/Philadelphia-Sixers/22/Rosters/Regular/2017">PHL</a>, <a href="/nba/teams/Portland-Trail-Blazers/24/Rosters/Regular/2009">POR</a>, <a href="/nba/teams/Sacramento-Kings/25/Rosters/Regular/2010">SAC</a></td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/39601">Sergio Rodriguez</a></td>
<td class="nowrap" rel="3">SG</td>
<td class="nowrap" rel="76">6-4</td>
<td class="nowrap" rel="-">-</td>
<td class="nowrap" rel="19771012"><a href="/info/birthdays/19771012/1">Oct 12, 1977</a></td>
<td class="nowrap" rel="1999"><a href="/nba/draft/past_drafts/1999" target="_blank">1999</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
</tbody>
import requests
from bs4 import BeautifulSoup
import pandas as pd


playernames=['Carlos Delfino', 'Sergio Rodriguez']

result = pd.DataFrame()
for name in playernames:

    fname=name.split(" ")[0]
    lname=name.split(" ")[1]
    url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
    response = requests.get(url)

    soup = BeautifulSoup(response.content, 'html.parser')

    # check the response url
    if (response.url == "https://basketball.realgm.com/search..."):
        # parse the search results, finding players who played in NBA
        ... get urls from the table ...
        soup.table...  # etc.
        foreach url in table:
            response = requests.get(player_url)
            soup = BeautifulSoup(response.content, 'html.parser')
            # call the parse function for a player page
            ...
            parse_player(soup)
    else: # we have a player page
        # call the parse function for a player page, same as above
        ...
        parse_player(soup)

    try:
        table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
        table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')

        df1 = pd.read_html(str(table1))[0]
        df2 = pd.read_html(str(table2))[0]

        commonCols = list(set(df1.columns) & set(df2.columns))
        df = df1.merge(df2, how='left', on=commonCols)
        df['Player'] = name

    except:
        print ('No international table for %s.' %name)
        df = pd.DataFrame([name], columns=['Player'])

Upvotes: 0

Views: 66

Answers (2)

Shaunak Sen
Shaunak Sen

Reputation: 578

Pandas has a very useful method to read html directly. This is especially useful if you are looking to get information from tables, as is applicable to you. Basically, pandas will scrape the website for any tables and read the tables as dataframes. Read more about it here

The problem here is that you need to access the link of the player as well and the read_html method reads the table as text and does not consider the tags.

Nevertheless, I found a possible solution. It is by no means the best one, but hopefully, you can use and improve it.

The approach is:

  1. Read the table using read_html method
  2. Get the required player name from the table (the player with NBA != '-')
  3. There might be multiple players with this name - say there are 3 Sergio Rodriguez, but only the 2nd one has played the NBA - you would need this index , i.e index=1(assume starting index is 0) to lookup the link later
  4. To get the index, we query the table for the player name and get the index location of that player.
  5. Now we search through all links whose text is Sergio Rodriguez
  6. We only pick out the link with matching index, i.e if the index is 1 (starting from 0) we pick out the 2nd link with text == Sergio Rodriguez
import pandas as pd
import requests
from bs4 import BeautifulSoup

# read the data from the website as a list of dataframes (tables)
web_data = pd.read_html('https://basketball.realgm.com/search?q=Sergio+Rodriguez')

# the table you need is the second to last one
required_table = web_data[len(web_data)-2]

print (required_table)
>>>
                    Player Pos   HT   WT    Birth Date  Draft Year College                 NBA
0  Sergio Rodriguez Febles  SF  6-7  202  Oct 18, 1993        2015       -                   -
1         Sergio Rodriguez  PG  6-3  176  Jun 12, 1986        2006       -  NYK, PHL, POR, SAC
2         Sergio Rodriguez  SG  6-4    -  Oct 12, 1977        1999       -                   -
### get the player name who has played in NBA
required_player_name = required_table.loc[required_table['NBA']!='-']['Player'].values[0]

print (required_player_name)
>>>
Sergio Rodriguez
## check for duplicate players with this name (reset index so that we get the indices of player with the same name in order)

table_with_player = required_table.loc[(required_table['Player']==required_player_name)].reset_index(drop=True)

# get the indices of player where NBA is not '-'
index_of_player_to_get = list(table_with_player[table_with_player['NBA']!='-'].index)[0]

print (index_of_player_to_get)


### basically if indices_of_player_to_get = 2 (say) then we need the 3rd  link with player name == required_player_name
>>>
0

Now we can read in all the links and pull out the link at the position of index_of_player_to_get among all links with name Sergio Rodriguez


url='https://basketball.realgm.com/search?q=Sergio+Rodriguez'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

## get all links
all_links = soup.find_all('a', href=True)

link_idx = -1
for link in all_links:
    if link.text == required_player_name:
        # player name found, inc link_idx
        link_idx+=1
        if link_idx == index_of_player_to_get:
            print (link['href'])
>>>
/player/Sergio-Rodriguez/Summary/85

Upvotes: 1

Juan C
Juan C

Reputation: 6132

So, as you know your rel is always in the eigth column of the table, you can do something like this:

soup = BeautifulSoup(html)

rows = [row for row in soup.find_all('tr')] # Get each row from the table

eighth_text = [col.find_all('td')[7].text for col in rows] # get text from eighth column
idx = [n for n,i in enumerate(eighth_text) if i!='-'] #Get the index of all rows that have text (are NBA players)

Then you can access that (or those) player(s) with something like:

for i in idx:
    print(rows[i].a)

Or whatever attribute you're looking for. Probably there are way more pythonic ways, but I priorize ease of understanding.

Upvotes: 0

Related Questions