FujiApple21
FujiApple21

Reputation: 35

Python Webscraping Approach for Comparing Football Players' college alma maters with total NFL Fantasy Football output

I am looking to a data science project where I will be able to sum up the fantasy football points by the college the players went to (e.g. Alabama has 56 active players in the NFL so I will go through a database and add up all of their fantasy points to compare with other schools).

I was looking at the website: https://fantasydata.com/nfl/fantasy-football-leaders?season=2020&seasontype=1&scope=1&subscope=1&aggregatescope=1&range=3

and I was going to use Beautiful Soup to scrape the rows of players and statistics and ultimately, fantasy football points.

However, I am having trouble figuring out how to extract the players' college alma mater. To do so, I would have to:

Any suggestions here?

Upvotes: 0

Views: 550

Answers (2)

chitown88
chitown88

Reputation: 28630

I agree, API are the way to go if they are there. My second "go to" is pandas' .read_html() (which uses BeautifulSoup under the hood to parse <table> tags. Here's an alternate solution using ESPNs api to get team roster links, then use pandas to pull the table from each link. Saves you the trouble of having to iterate througheach player to get the college (I whish they just had an api that returned all players. nfl.com USED to have that, but is no longer publicly available, that I know of).

Code:

import requests
import pandas as pd
    
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/athletes/101'

all_teams = []
roster_links = []
for i in range(1,35):
    url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams/{teamId}'.format(teamId=i)
    jsonData = requests.get(url).json()
    print (jsonData['team']['displayName'])
    for link in jsonData['team']['links']:
        if link['text'] == 'Roster':
            roster_links.append(link['href'])
            break
    
for link in roster_links:
    print (link)
    tables = pd.read_html(link)
    df = pd.concat(tables).drop('Unnamed: 0',axis=1)
    df['Jersey'] = df['Name'].str.replace("([A-Za-z.' ]+)", '')
    df['Name'] = df['Name'].str.extract("([A-Za-z.' ]+)")
    all_teams.append(df)

final_df = pd.concat(all_teams).reset_index(drop=True)

Output:

print (final_df)
                  Name POS  Age      HT       WT Exp           College Jersey
0            Matt Ryan  QB   35   6' 4"  217 lbs  13    Boston College      2
1          Matt Schaub  QB   39   6' 6"  245 lbs  17          Virginia      8
2       Todd Gurley II  RB   26   6' 1"  224 lbs   6           Georgia     21
3           Brian Hill  RB   25   6' 1"  219 lbs   4           Wyoming     23
4       Qadree Ollison  RB   24   6' 1"  232 lbs   2        Pittsburgh     30
               ...  ..  ...     ...      ...  ..               ...    ...
1772    Jonathan Owens   S   25  5' 11"  210 lbs   2  Missouri Western     36
1773       Justin Reid   S   23   6' 1"  203 lbs   3          Stanford     20
1774  Ka'imi Fairbairn  PK   26   6' 0"  183 lbs   5              UCLA      7
1775       Bryan Anger   P   32   6' 3"  205 lbs   9        California      9
1776         Jon Weeks  LS   34  5' 10"  242 lbs  11            Baylor     46

[1777 rows x 8 columns]

Upvotes: 2

Paul M.
Paul M.

Reputation: 10809

There's no need for Selenium, or other headless, automated browsers. That's overkill.

If you take a look at your browser's network traffic, you'll notice that your browser makes a POST request to this REST API endpoint: https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read

If the POST request is well-formed, the API responds with JSON, containing information about every single player. Normally, this information would be used to populate the DOM asynchronously using JavaScript. There's quite a lot of information there, but unfortunately, the college information isn't part of the JSON response. However, there is a field PlayerUrlString, which is a relative-URL to a given player's profile page, which does contain the college name. So:

  • Make a POST request to the API to get information about all players

For each player in the response JSON:

  • Visit that player's profile
  • Use BeautifulSoup to extract the college name from the current player's profile

Code:

def main():

    import requests
    from bs4 import BeautifulSoup

    url = "https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read"

    data = {
        "sort": "FantasyPoints-desc",
        "pageSize": "50",
        "filters.season": "2020",
        "filters.seasontype": "1",
        "filters.scope": "1",
        "filters.subscope": "1",
        "filters.aggregatescope": "1",
        "filters.range": "3",
    }

    response = requests.post(url, data=data)
    response.raise_for_status()

    players = response.json()["Data"]
    for player in players:
        url = "https://fantasydata.com" + player["PlayerUrlString"]

        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, "html.parser")

        college = soup.find("dl", {"class": "dl-horizontal"}).findAll("dd")[-1].text.strip()

        print(player["Name"] + " went to " + college)

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

Patrick Mahomes went to Texas Tech
Kyler Murray went to Oklahoma
Aaron Rodgers went to California
Russell Wilson went to Wisconsin
Josh Allen went to Wyoming
Deshaun Watson went to Clemson
Ryan Tannehill went to Texas A&M
Lamar Jackson went to Louisville
Dalvin Cook went to Florida State
...

You can also edit the pageSize POST parameter in the data dictionary. The 50 corresponds to information about the first 50 players in the JSON response (according to the filters set by the other POST parameters). Changing this value will yield more or less players in the JSON response.

Upvotes: 2

Related Questions