Reputation: 1
Currently, I am trying to perform web scraping using Python on the ESPN website to this upcoming NFL football game schedule for each week and store into a dataframe. I’m unable to find a way to add the desired output. I am also super new to coding, python and everything in general. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:https://www.espn.com/nfl/schedule/_/week/1/year/2024/seasontype/2
I wanted to output a data frame with columns: away team, home team, game time, game location, and odds.
So far using the following code, I was able to get the team names and put it into a dataframe. See below.
url = 'https://www.espn.com/nfl/scoreboard/_/week/1/year/2024/seasontype/2'
# Headers to make the request look like it's coming from a browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
# Send a GET request to the webpage with headers
response = requests.get(url, headers=headers)
src = (response.content)
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the game containers
game_containers = soup.find_all('a',class_="AnchorLink" )
team_names = soup.find_all('div', class_='ScoreCell__TeamName ScoreCell__TeamName--
shortDisplayName truncate db')
# List to hold the team names
team_list = [team.text for team in team_names]
# Pair the team names into away and home teams
away_teams = team_list[::2] # Every other team starting from the first
home_teams = team_list[1::2] # Every other team starting from the second
# Create a DataFrame from the data
df = pd.DataFrame({
'Away Team': away_teams,
'Home Team': home_teams
})
# Print the DataFrame
print(df)
I'll explain below what I did and What I see from the HTML inspect.
This is where I am stuck and my shallow knowledge limits me. Not sure how to code to extract those information from this HTML code. Any help or advice is appreciate it. Thanks!
<div class="mt3">
Then each game box section is then displayed by
t<div><div class="ScheduleTables mb5 ScheduleTables--nfl ScheduleTables--football
When I dive deeper, the lines:
<tbody class="Table__TBODY"><tr class="Table__TR Table__TR--sm Table__even" data-idx="0">
contains all the information I need.
Embedded under Table__TR class is the following:
<td class="colspan__col Table__TD"> <td class="date__col Table__TD"><a class="AnchorLink" tabindex="0" href="/nfl/game/_/gameId/401671789/ravens- chiefs">8:20 PM</a></td> <td class="location__col Table__TD"><div>GEHA Field at Arrowhead Stadium, Kansas City, MO</div></td> <td class="odds__col Table__TD"><div class="Odds__Message"><a class="AnchorLink" tabindex="0" data-track-event_name="espn bet interaction" data-track- event_detail="espnbet:espn:nfl:schedule:pointSpread:KC -3" data-track-basemetrics="sport,league"
Upvotes: 0
Views: 335
Reputation: 28640
3 Ways to do it:
1 - Let pandas parse it:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.espn.com/nfl/schedule/_/week/1/year/2024/seasontype/2'
# Headers to make the request look like it's coming from a browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers).text
dfs = pd.read_html(response)
df = pd.concat(dfs)
2 - Your methods with bs4 - which is far more complicated and not going to even code it out. But what you should do is iterate by each <tr>
tag, and store each column value in the <td>
tags.
3 - Use an api - see my other solution
Upvotes: 0
Reputation: 28640
Better way is to get data from apis, as it's more robust (Ie. It's not reliant on the html structure. If ESPN changes their web design, your code breaks - but with the api, data will usually always come in the same json form), and you get far more data if you want it:
import requests
import pandas as pd
# Function to check if an element is a list or dictionary
def is_list_or_dict(x):
return isinstance(x, (list, dict))
def merge_data(data):
game_df = pd.json_normalize(data)
game_df = game_df.drop(['uid'], axis=1)
team_df = pd.json_normalize(data,
record_path=['competitions', 'competitors'],
meta=['id'],
meta_prefix='game.')
team_df = team_df.drop(['id', 'uid'], axis=1)
odds = pd.json_normalize(data,
record_path=['competitions', 'odds'],
meta=['id'],
meta_prefix='game.')
columns_to_keep = game_df.applymap(is_list_or_dict).all(axis=0) == False
# Filter the DataFrame to keep only the desired columns
game_df = game_df.loc[:, columns_to_keep]
df = pd.merge(game_df, team_df, how='outer', left_on=['id'], right_on=['game.id']).drop(['id'], axis=1)
df = pd.merge(df, odds, how='outer', left_on=['game.id'], right_on=['game.id'])
return df
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
url = 'https://cdn.espn.com/core/nfl/schedule?xhr=1&year=2024'
jsonData = requests.get(url, headers=headers).json()
calendar = jsonData['content']['calendar']
dfs = []
for each in calendar:
seasontype = each['value']
seasontypeLabel = each['label']
weeks = each['entries']
for eachWeek in weeks:
weekNo = eachWeek['value']
weekLabel = eachWeek['label']
url = f'https://cdn.espn.com/core/nfl/schedule?xhr=1&year=2024&seasontype={seasontype}&week={weekNo}'
jsonData = requests.get(url, headers=headers).json()
schedules = jsonData['content']['schedule']
print(f'Aquiring {seasontypeLabel}: {weekLabel}')
for k,v in schedules.items():
games = v['games']
df = merge_data(games)
df['seasontype'] = seasontype
df['seasontypeLabel'] = seasontypeLabel
df['week'] = weekNo
df['weekLabel'] = weekLabel
dfs.append(df)
results = pd.concat(dfs)
Upvotes: 1