sobrio35
sobrio35

Reputation: 123

Overwriting values in column created with Python for loop

I'm building an automated MLB schedule from a base URL and a loop through a list of team names as they appear in the URL. Using pd.read_html I get each team's schedule. The only thing I'm missing is, for each team's page, the team name itself, which I'd like as a new column 'team_name'. I have a small sample of my goal at the end of this post.

Below, is what I have so far, and if you run this, the print out does exactly what I need for just one team.

import pandas as pd
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners']

df = pd.DataFrame()
for team in (team_list):
    new_url = url_base + team
    df = df.append(pd.read_html(new_url)[1])
    df['team_name'] = team

print(df[['team_name', 'Opponent']]) 

The trouble is, when I have all 30 teams in team_list, the value of team_name keeps getting overwritten, so that all 4000+ records list the same team name (the last one in team_list). I've tried dynamically assigning only certain rows the team value by using

df['team_name'][a:b] = team

where a, b are the starting and ending rows on the dataframe for the index team; but this gives KeyError: 'team_name'. I've also tried using placeholder series and dataframes for team_name, then merging with df later, but get duplication errors. On a larger scale, what I'm looking for is this:

           team_name          opponent
0   seattle-mariners  new-york-yankees
1   seattle-mariners  new-york-yankees
2   seattle-mariners    boston-red-sox
3   seattle-mariners    boston-red-sox
4   seattle-mariners  san-diego-padres
5   seattle-mariners  san-diego-padres
6    cincinatti-reds  new-york-yankees
7    cincinatti-reds  new-york-yankees
8    cincinatti-reds    boston-red-sox
9    cincinatti-reds    boston-red-sox
10   cincinatti-reds  san-diego-padres
11   cincinatti-reds  san-diego-padres

Upvotes: 0

Views: 1131

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62503

The original code df['team_name'] = team rewrites team_name for the entire df. The code below creates a placeholder, df_team, where team_name is updated and then df.append(df_team).

url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners', 'houston-astros']

Option A: for loop

df_list = list()

for team in (team_list):
    new_url = url_base + team
    df_team = pd.read_html(new_url)[1]
    df_team['team_name'] = team
    df_list.append(df_team)

df = pd.concat(df_list)

Option B: list comprehension:

df_list = [pd.read_html(url_base + team)[1].assign(team=team) for team in team_list]

df = pd.concat(df_list)

df.head()

enter image description here

df.tail()

enter image description here

Upvotes: 4

Related Questions