Reputation: 123
I'm building an automated MLB schedule from a base URL and a loop through a list of team names as they appear in the URL. Using pd.read_html I get each team's schedule. The only thing I'm missing is, for each team's page, the team name itself, which I'd like as a new column 'team_name'. I have a small sample of my goal at the end of this post.
Below, is what I have so far, and if you run this, the print out does exactly what I need for just one team.
import pandas as pd
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners']
df = pd.DataFrame()
for team in (team_list):
new_url = url_base + team
df = df.append(pd.read_html(new_url)[1])
df['team_name'] = team
print(df[['team_name', 'Opponent']])
The trouble is, when I have all 30 teams in team_list, the value of team_name keeps getting overwritten, so that all 4000+ records list the same team name (the last one in team_list). I've tried dynamically assigning only certain rows the team value by using
df['team_name'][a:b] = team
where a, b are the starting and ending rows on the dataframe for the index team; but this gives KeyError: 'team_name'. I've also tried using placeholder series and dataframes for team_name, then merging with df later, but get duplication errors. On a larger scale, what I'm looking for is this:
team_name opponent
0 seattle-mariners new-york-yankees
1 seattle-mariners new-york-yankees
2 seattle-mariners boston-red-sox
3 seattle-mariners boston-red-sox
4 seattle-mariners san-diego-padres
5 seattle-mariners san-diego-padres
6 cincinatti-reds new-york-yankees
7 cincinatti-reds new-york-yankees
8 cincinatti-reds boston-red-sox
9 cincinatti-reds boston-red-sox
10 cincinatti-reds san-diego-padres
11 cincinatti-reds san-diego-padres
Upvotes: 0
Views: 1131
Reputation: 62503
The original code df['team_name'] = team
rewrites team_name
for the entire df
. The code below creates a placeholder, df_team
, where team_name
is updated and then df.append(df_team)
.
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners', 'houston-astros']
for loop
df_list = list()
for team in (team_list):
new_url = url_base + team
df_team = pd.read_html(new_url)[1]
df_team['team_name'] = team
df_list.append(df_team)
df = pd.concat(df_list)
list comprehension
:df_list = [pd.read_html(url_base + team)[1].assign(team=team) for team in team_list]
df = pd.concat(df_list)
Upvotes: 4