Reputation: 23
I have a set of players (common_opps
), the size of which changes over time, I need to count values from 3d dataframe (df_versus
) and then return the mean. I am repeating this function many times and the execution time raises every time. It is Ok for small amount of players, but it comes to the moment when this loop is iterating by over 500 players and the waiting time is very long. So I was wondering if there is a way to improve this, by changing this loop to something else like lambda functions or something. I tried Numba, but can't configure it properly.
def common_opponents(p1, p2):
common_opps = np.array(s_opponents[p1].intersection(s_opponents[p2]))
serve1, serve2, ace1, ace2, df1, df2, break1, break2 = 0, 0, 0, 0, 0, 0, 0, 0
length = len(common_opps)
if length == 0:
return serve1, serve2, ace1, ace2, df1, df2, break1, break2
for opponent in common_opps:
serve1 += df_versus[p1][opponent]["serve_won"] / df_versus[p1][opponent]["serve_total"]
serve2 += df_versus[p2][opponent]["serve_won"] / df_versus[p2][opponent]["serve_total"]
ace1 += df_versus[p1][opponent]["ace"] / df_versus[p1][opponent]["serve_total"]
ace2 += df_versus[p2][opponent]["ace"] / df_versus[p2][opponent]["serve_total"]
df1 += df_versus[p1][opponent]["df"] / df_versus[p1][opponent]["serve_total"]
df2 += df_versus[p2][opponent]["df"] / df_versus[p2][opponent]["serve_total"]
break1 += df_versus[p1][opponent]["break_won"] / df_versus[p1][opponent]["break_total"]
break2 += df_versus[p2][opponent]["break_won"] / df_versus[p2][opponent]["break_total"]
return (serve1/length, serve2/length, ace1/length, ace2/length,
df1/length, df2/length, break1/length, break2/length)
(p1
) and (p2
) are names of the players in string like 'Roger Federer' and 'Rafael Nadal', (s_opponents1
) and (s_opponents2
) are sets with players names, (common_opps
) is also a set with names, (df_versus
) is a multiindex data frame made with
versus_index = pd.MultiIndex.from_product([unique_player, ["serve_won", "serve_total", "ace", "df", "break_won", "break_total", "won", "lost"]])
df_versus = pd.DataFrame(0, index=versus_index, columns=unique_player)
it is filled over time with proper values and unique_players is a list of unique players in whole dataset
If Nadal and Federer would have only those 3 players I would consider that dataframe, of course the zeros should be replaced by their stats
Upvotes: 1
Views: 80
Reputation: 23
Ok, so I figured out how to solve this optimization problem
Instead of this for loop I added 4 new columns to the df_versus (['serve_ratio', 'ace_ratio', 'df_ratio', 'break_ratio']
) than I used .apply
method to apply np.mean
that is super fast and return it as arrays. I think that now this process is about 100 times faster
serve1, ace1, df1, break1, serve2, ace2, df2, break2 = (
df_versus.loc[([p1, p2], ['serve_ratio', 'ace_ratio', 'df_ratio', 'break_ratio']),
common_opps].apply(np.mean, axis=1).to_numpy())
Upvotes: 1