Reputation: 359
I have data from a sporting event and the knowledge is that there is a bias at each home arena that I want to make adjustments for. I have already created a dictionary where the arena is the key and the value is the adjustment I want to make.
So for each row, I want to take the home team, get the adjustment, and then subtract that from the distance column. I have the following code but I cannot seem to get it working.
#Making the dictionary, this is working properly
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def make_adjustment(df):
team = df.home_team
distance = df.event_distance
adj_dist = distance - adj_shot_dict[team]
return adj_dist
df['adj_dist'] = df['event_distance'].apply(make_adjustment)
Upvotes: 0
Views: 47
Reputation: 120391
IIUC, you already have the dict and you want simply subtract adj_shot_dict
to event_distance
column:
df['adj_dist'] = df['event_distance'] - df['home_team'].map(adj_shot_dict)
Old answer
Group by home_team
, compute the average of event_distance
then subtract the result to event_distance
:
df['adj_dist'] = df['event_distance'] \
- df.groupby('home_team')['event_distance'] \
.transform('mean').round(2)
# OR
df['adj_dist'] = df.groupby('home_team')['event_distance'] \
.apply(lambda x: x - x.mean().round(2))
Performance
>>> len(df)
60000
>>> df.sample(5)
home_team event_distance
5 team3 60
4 team2 50
1 team2 20
1 team2 20
0 team1 10
def loop():
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def loop2():
df.groupby('home_team')['event_distance'].transform('mean').round(2)
>>> %timeit loop()
13.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit loop2()
3.62 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Total process
>>> %timeit df['event_distance'] - df.groupby('home_team')['event_distance'].transform('mean').round(2)
3.7 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2