Reputation: 1

Learning how to web scrape ESPN NFL Schedule with Python

Currently, I am trying to perform web scraping using Python on the ESPN website to this upcoming NFL football game schedule for each week and store into a dataframe. I’m unable to find a way to add the desired output. I am also super new to coding, python and everything in general. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:https://www.espn.com/nfl/schedule/_/week/1/year/2024/seasontype/2

I wanted to output a data frame with columns: away team, home team, game time, game location, and odds.

So far using the following code, I was able to get the team names and put it into a dataframe. See below.

url = 'https://www.espn.com/nfl/scoreboard/_/week/1/year/2024/seasontype/2'
# Headers to make the request look like it's coming from a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko)  Chrome/58.0.3029.110 Safari/537.3"
    }

# Send a GET request to the webpage with headers
response = requests.get(url, headers=headers)
src = (response.content)

soup = BeautifulSoup(response.content, 'html.parser')
# Find all the game containers
game_containers = soup.find_all('a',class_="AnchorLink" )
team_names = soup.find_all('div', class_='ScoreCell__TeamName ScoreCell__TeamName--
shortDisplayName     truncate db')
# List to hold the team names
team_list = [team.text for team in team_names]
# Pair the team names into away and home teams
away_teams = team_list[::2]  # Every other team starting from the first
home_teams = team_list[1::2]  # Every other team starting from the second

# Create a DataFrame from the data
df = pd.DataFrame({
    'Away Team': away_teams,
    'Home Team': home_teams
})

# Print the DataFrame
 print(df)

I'll explain below what I did and What I see from the HTML inspect.

This is where I am stuck and my shallow knowledge limits me. Not sure how to code to extract those information from this HTML code. Any help or advice is appreciate it. Thanks!

WHAT I SEE Now getting the time, location and odds is tricky and I need some help as I have no idea when looking at the HTML code on ESPN. From what I can tell, the body of the webpage that contains all the schedule is: <div class="mt3"> Then each game box section is then displayed by t<div><div class="ScheduleTables mb5 ScheduleTables--nfl ScheduleTables--football

When I dive deeper, the lines: <tbody class="Table__TBODY"><tr class="Table__TR Table__TR--sm Table__even" data-idx="0"> contains all the information I need.

Embedded under Table__TR class is the following: <td class="colspan__col Table__TD"> <td class="date__col Table__TD"><a class="AnchorLink" tabindex="0" href="/nfl/game/_/gameId/401671789/ravens- chiefs">8:20 PM</a></td> <td class="location__col Table__TD"><div>GEHA Field at Arrowhead Stadium, Kansas City, MO</div></td> <td class="odds__col Table__TD"><div class="Odds__Message"><a class="AnchorLink" tabindex="0" data-track-event_name="espn bet interaction" data-track- event_detail="espnbet:espn:nfl:schedule:pointSpread:KC -3" data-track-basemetrics="sport,league"

Upvotes: 0

Answers (2)

chitown88

Reputation: 28640

3 Ways to do it:

1 - Let pandas parse it:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.espn.com/nfl/schedule/_/week/1/year/2024/seasontype/2'
# Headers to make the request look like it's coming from a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
    }

response = requests.get(url, headers=headers).text
dfs = pd.read_html(response)
df = pd.concat(dfs)

2 - Your methods with bs4 - which is far more complicated and not going to even code it out. But what you should do is iterate by each <tr> tag, and store each column value in the <td> tags.

3 - Use an api - see my other solution

Upvotes: 0

chitown88

Reputation: 28640

Better way is to get data from apis, as it's more robust (Ie. It's not reliant on the html structure. If ESPN changes their web design, your code breaks - but with the api, data will usually always come in the same json form), and you get far more data if you want it:

import requests
import pandas as pd

# Function to check if an element is a list or dictionary
def is_list_or_dict(x):
    return isinstance(x, (list, dict))

def merge_data(data):
    game_df = pd.json_normalize(data)
    game_df = game_df.drop(['uid'], axis=1)

    team_df = pd.json_normalize(data,
                            record_path=['competitions', 'competitors'],
                            meta=['id'],
                            meta_prefix='game.')
    team_df = team_df.drop(['id', 'uid'], axis=1)
    
    odds = pd.json_normalize(data,
                            record_path=['competitions', 'odds'],
                            meta=['id'],
                            meta_prefix='game.')
    
    
    columns_to_keep = game_df.applymap(is_list_or_dict).all(axis=0) == False

    # Filter the DataFrame to keep only the desired columns
    game_df = game_df.loc[:, columns_to_keep]
    
    df = pd.merge(game_df, team_df, how='outer', left_on=['id'], right_on=['game.id']).drop(['id'], axis=1)
    df = pd.merge(df, odds, how='outer', left_on=['game.id'], right_on=['game.id'])
    
    return df

    
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/58.0.3029.110 Safari/537.3"
    }


url = 'https://cdn.espn.com/core/nfl/schedule?xhr=1&year=2024'
jsonData = requests.get(url, headers=headers).json()
calendar = jsonData['content']['calendar']

dfs = []
for each in calendar:
    seasontype = each['value']
    seasontypeLabel = each['label']
    weeks = each['entries']
    for eachWeek in weeks:
        weekNo = eachWeek['value']
        weekLabel = eachWeek['label']
    
        url = f'https://cdn.espn.com/core/nfl/schedule?xhr=1&year=2024&seasontype={seasontype}&week={weekNo}'
        jsonData = requests.get(url, headers=headers).json()
        schedules = jsonData['content']['schedule']
        
        print(f'Aquiring {seasontypeLabel}: {weekLabel}')

        for k,v in schedules.items():
            games = v['games']
            
            df = merge_data(games)
            df['seasontype'] = seasontype
            df['seasontypeLabel'] = seasontypeLabel
            df['week'] = weekNo
            df['weekLabel'] = weekLabel
            
            dfs.append(df)



results = pd.concat(dfs)

Upvotes: 1

Learning how to web scrape ESPN NFL Schedule with Python

Answers (2)

Related Questions