Will Holt
Will Holt

Reputation: 11

Web scrape Sports-Reference with Python Beautiful Soup

I am trying to scrape data from Nick Saban's sports reference page so that I can pull in the list of All-Americans he coached and then his Bowl-Win Loss Percentage.

I am new to Python so this has been a massive struggle. When I inspect the page I see div id = #leaderboard_all-americans class = "data_grid_box"

When I run the code below I am getting the Coaching Record table, which is the first table on the site. I tried using different indexes thinking it may give me a different result but that did not work either.

Ultimately, I want to get the All-American data and turn it into a data frame.

import requests
import bs4
import pandas as pd

saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))

All Americans

Upvotes: 1

Views: 908

Answers (1)

Ajax1234
Ajax1234

Reputation: 71461

sports-reference.com stores the HTML tables as comments in the basic request response. You have to first grab the commented block with the All-Americans and bowl results, and then parse that result:

import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)

Output:

all_americans

                  name       year
0       Jonathan Allen       2016
1        Javier Arenas       2009
2          Mark Barron       2011
3     Antoine Caldwell       2008
4    Ha Ha Clinton-Dix       2013
5        Terrence Cody  2008-2009
6       Landon Collins       2014
7         Amari Cooper       2014
8     Landon Dickerson       2020
9   Minkah Fitzpatrick  2016-2017
10       Reuben Foster       2016
11        Najee Harris       2020
12       Derrick Henry       2015
13    Dont'a Hightower       2011
14         Mark Ingram       2009
15         Jerry Jeudy       2018
16        Mike Johnson       2009
17       Barrett Jones  2011-2012
18           Mac Jones       2020
19          Ryan Kelly       2015
20     Cyrus Kouandjio       2013
21       Chad Lavalais       2003
22    Alex Leatherwood       2020
23     Rolando McClain       2009
24   Demarcus Milliner       2012
25         C.J. Mosley  2012-2013
26      Reggie Ragland       2015
27           Josh Reed       2001
28    Trent Richardson       2011
29    A'Shawn Robinson       2015
30        Cam Robinson       2016
31         Andre Smith       2008
32       DeVonta Smith       2020
33       Marcus Spears       2004
34  Patrick Surtain II       2020
35      Tua Tagovailoa       2018
36    Deionte Thompson       2018
37      Chance Warmack       2012
38       Ben Wilkerson       2004
39      Jonah Williams       2018
40    Quinnen Williams       2018

bowl_win_loss:

' .63 (#23)'

Upvotes: 1

Related Questions