Reputation: 35
I am looking to a data science project where I will be able to sum up the fantasy football points by the college the players went to (e.g. Alabama has 56 active players in the NFL so I will go through a database and add up all of their fantasy points to compare with other schools).
I was looking at the website: https://fantasydata.com/nfl/fantasy-football-leaders?season=2020&seasontype=1&scope=1&subscope=1&aggregatescope=1&range=3
and I was going to use Beautiful Soup to scrape the rows of players and statistics and ultimately, fantasy football points.
However, I am having trouble figuring out how to extract the players' college alma mater. To do so, I would have to:
Any suggestions here?
Upvotes: 0
Views: 550
Reputation: 28630
I agree, API are the way to go if they are there. My second "go to" is pandas
' .read_html()
(which uses BeautifulSoup under the hood to parse <table>
tags. Here's an alternate solution using ESPNs api to get team roster links, then use pandas to pull the table from each link. Saves you the trouble of having to iterate througheach player to get the college (I whish they just had an api that returned all players. nfl.com USED to have that, but is no longer publicly available, that I know of).
Code:
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/athletes/101'
all_teams = []
roster_links = []
for i in range(1,35):
url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams/{teamId}'.format(teamId=i)
jsonData = requests.get(url).json()
print (jsonData['team']['displayName'])
for link in jsonData['team']['links']:
if link['text'] == 'Roster':
roster_links.append(link['href'])
break
for link in roster_links:
print (link)
tables = pd.read_html(link)
df = pd.concat(tables).drop('Unnamed: 0',axis=1)
df['Jersey'] = df['Name'].str.replace("([A-Za-z.' ]+)", '')
df['Name'] = df['Name'].str.extract("([A-Za-z.' ]+)")
all_teams.append(df)
final_df = pd.concat(all_teams).reset_index(drop=True)
Output:
print (final_df)
Name POS Age HT WT Exp College Jersey
0 Matt Ryan QB 35 6' 4" 217 lbs 13 Boston College 2
1 Matt Schaub QB 39 6' 6" 245 lbs 17 Virginia 8
2 Todd Gurley II RB 26 6' 1" 224 lbs 6 Georgia 21
3 Brian Hill RB 25 6' 1" 219 lbs 4 Wyoming 23
4 Qadree Ollison RB 24 6' 1" 232 lbs 2 Pittsburgh 30
... .. ... ... ... .. ... ...
1772 Jonathan Owens S 25 5' 11" 210 lbs 2 Missouri Western 36
1773 Justin Reid S 23 6' 1" 203 lbs 3 Stanford 20
1774 Ka'imi Fairbairn PK 26 6' 0" 183 lbs 5 UCLA 7
1775 Bryan Anger P 32 6' 3" 205 lbs 9 California 9
1776 Jon Weeks LS 34 5' 10" 242 lbs 11 Baylor 46
[1777 rows x 8 columns]
Upvotes: 2
Reputation: 10809
There's no need for Selenium, or other headless, automated browsers. That's overkill.
If you take a look at your browser's network traffic, you'll notice that your browser makes a POST request to this REST API endpoint: https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read
If the POST request is well-formed, the API responds with JSON, containing information about every single player. Normally, this information would be used to populate the DOM asynchronously using JavaScript. There's quite a lot of information there, but unfortunately, the college information isn't part of the JSON response. However, there is a field PlayerUrlString
, which is a relative-URL to a given player's profile page, which does contain the college name. So:
For each player in the response JSON:
Code:
def main():
import requests
from bs4 import BeautifulSoup
url = "https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read"
data = {
"sort": "FantasyPoints-desc",
"pageSize": "50",
"filters.season": "2020",
"filters.seasontype": "1",
"filters.scope": "1",
"filters.subscope": "1",
"filters.aggregatescope": "1",
"filters.range": "3",
}
response = requests.post(url, data=data)
response.raise_for_status()
players = response.json()["Data"]
for player in players:
url = "https://fantasydata.com" + player["PlayerUrlString"]
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
college = soup.find("dl", {"class": "dl-horizontal"}).findAll("dd")[-1].text.strip()
print(player["Name"] + " went to " + college)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Patrick Mahomes went to Texas Tech
Kyler Murray went to Oklahoma
Aaron Rodgers went to California
Russell Wilson went to Wisconsin
Josh Allen went to Wyoming
Deshaun Watson went to Clemson
Ryan Tannehill went to Texas A&M
Lamar Jackson went to Louisville
Dalvin Cook went to Florida State
...
You can also edit the pageSize
POST parameter in the data
dictionary. The 50
corresponds to information about the first 50 players in the JSON response (according to the filters set by the other POST parameters). Changing this value will yield more or less players in the JSON response.
Upvotes: 2