Reputation: 45
I have a Pandas dataframe, week1_plays
in the following format:
What I want to do is add a column week1_plays['distance_from_receiver']
such that for each row in the dataframe, we grab the keys of gameId, playId, frameId
and find the x and y position of the player with those keys and position == 'WR'
. Then I'll calculate the distance from the receiver with the following function:
def get_distance(rec_x, rec_y, def_x, def_y):
distance = np.sqrt( ((def_x - rec_x)**2) + ((def_y - rec_y)**2) )
return distance
For example using the sample provided, the row 0 input to the function would be
get_distance(91.35, 44.16, 88.89, 36.47)
The current solution I have is to use a lambda function on the dataframe as such:
week1_topReceivers['distance_from_receiver'] = week1_topReceivers.apply(lambda row: get_distance(week1_wr_position.loc[np.where((week1_topReceivers['playId'] == row['playId']) & (week1_topReceivers['frameId'] == row['frameId']) & (week1_topReceivers['gameId'] == row['frameId']))]['x'],
week1_topReceivers.loc[np.where((week1_topReceivers['playId'] == row['playId']) & (week1_topReceivers['frameId'] == row['frameId']) & (week1_topReceivers['gameId'] == row['frameId']))]['y'], row['x'], row['y']), axis = 1)
but querying the dataframe for the first two inputs takes a very long time with a large dataframe. I know there has to be a more optimal solution to this but my searches online aren't turning up any better options.
EDIT: Here is a larger sample and the expected output:
SAMPLE
x y o dir event position frameId team gameId playId playDirection route
88.89 36.47 105.63 66.66 None SS 1 home 2018090600 75 left NaN
91.35 44.16 290.45 16.86 None WR 1 away 2018090600 75 left HITCH
86.31 22.01 70.12 168.91 None FS 1 home 2018090600 75 left NaN
73.64 28.70 103.05 219.41 None FS 1 home 2018090600 75 left NaN
86.48 31.12 95.90 33.36 None MLB 1 home 2018090600 75 left NaN
82.67 20.53 81.14 174.57 None CB 1 home 2018090600 75 left NaN
84.00 43.49 108.23 110.32 None CB 1 home 2018090600 75 left NaN
85.63 26.59 87.69 38.80 None LB 1 home 2018090600 75 left NaN
88.89 36.47 105.63 68.49 None SS 2 home 2018090600 75 left NaN
91.37 44.17 290.45 29.61 None WR 2 away 2018090600 75 left HITCH
86.32 22.00 70.88 119.04 None FS 2 home 2018090600 75 left NaN
73.64 28.70 104.57 228.17 None FS 2 home 2018090600 75 left NaN
86.48 31.11 101.10 30.26 None MLB 2 home 2018090600 75 left NaN
82.68 20.53 82.24 147.46 None CB 2 home 2018090600 75 left NaN
84.02 43.49 107.33 106.73 None CB 2 home 2018090600 75 left NaN
85.64 26.61 87.69 37.51 None LB 2 home 2018090600 75 left NaN
88.88 36.47 107.02 57.53 None SS 3 home 2018090600 75 left NaN
91.37 44.17 290.45 32.20 None WR 3 away 2018090600 75 left HITCH
86.33 22.00 71.88 93.49 None FS 3 home 2018090600 75 left NaN
73.63 28.69 104.57 227.74 None FS 3 home 2018090600 75 left NaN
EXPECTED OUTPUT:
x y o dir event position frameId team gameId playId playDirection route distance_from_receiver
88.89 36.47 105.63 66.66 None SS 1 home 2018090600 75 left NaN 8.07
91.35 44.16 290.45 16.86 None WR 1 away 2018090600 75 left HITCH 0.00
86.31 22.01 70.12 168.91 None FS 1 home 2018090600 75 left NaN 22.72
73.64 28.70 103.05 219.41 None FS 1 home 2018090600 75 left NaN 23.51
86.48 31.12 95.90 33.36 None MLB 1 home 2018090600 75 left NaN 13.92
82.67 20.53 81.14 174.57 None CB 1 home 2018090600 75 left NaN 25.17
84.00 43.49 108.23 110.32 None CB 1 home 2018090600 75 left NaN 7.38
85.63 26.59 87.69 38.80 None LB 1 home 2018090600 75 left NaN 18.48
88.89 36.47 105.63 68.49 None SS 2 home 2018090600 75 left NaN 8.09
91.37 44.17 290.45 29.61 None WR 2 away 2018090600 75 left HITCH 0.00
86.32 22.00 70.88 119.04 None FS 2 home 2018090600 75 left NaN 22.74
73.64 28.70 104.57 228.17 None FS 2 home 2018090600 75 left NaN 23.53
86.48 31.11 101.10 30.26 None MLB 2 home 2018090600 75 left NaN 13.95
82.68 20.53 82.24 147.46 None CB 2 home 2018090600 75 left NaN 25.19
84.02 43.49 107.33 106.73 None CB 2 home 2018090600 75 left NaN 7.39
85.64 26.61 87.69 37.51 None LB 2 home 2018090600 75 left NaN 18.47
88.88 36.47 107.02 57.53 None SS 3 home 2018090600 75 left NaN 8.09
91.37 44.17 290.45 32.20 None WR 3 away 2018090600 75 left HITCH 0.00
86.33 22.00 71.88 93.49 None FS 3 home 2018090600 75 left NaN 22.74
73.63 28.69 104.57 227.74 None FS 3 home 2018090600 75 left NaN 23.54
Upvotes: 1
Views: 66
Reputation: 2714
You are looking for a merge
or join
operation. Try something like this:
df = pd.DataFrame({'gameId':[1,1,1,1,1,1],'playId':[1,1,1,1,1,1],
'frameId':[1,1,1,2,2,2], 'position':['A','B','WR','C','WR','D'],
'x':[87,56,45,34,45,67], 'y':[25,36,47,365,25,36]})
# create a table with just the wide receiver positions:
wr = df.loc[df.position=='WR'].drop(columns='position')
# merge the wide receiver x,y values into the original table based on the keys:
df = df.merge(wr, how='outer', on=['gameId', 'playId', 'frameId'], suffixes=['', '_wr'])
# apply your function to calculate the column (avoid using apply because it's super slow)
df['dist_from_wr'] = [get_distance(x, y, x_wr, y_wr) for x, y, x_wr, y_wr
in zip(df.x, df.y, df.x_wr, df.y_wr)]
Note as well, that you're lucky here because your function is already vectorized (which is not always the case) so you can actually do this even more efficiently by passing entire columns as input arguments as follows:
df['dist_from_wr'] = get_distance(df.x, df.y, df.x_wr, df.y_wr)
Result:
| gameId | playId | frameId | position | x | y | x_wr | y_wr | dist_from_wr |
|-------:|-------:|--------:|:---------|----:|----:|-----:|-----:|-------------:|
| 1 | 1 | 1 | A | 87 | 25 | 45 | 47 | 47.4131 |
| 1 | 1 | 1 | B | 56 | 36 | 45 | 47 | 15.5563 |
| 1 | 1 | 1 | WR | 45 | 47 | 45 | 47 | 0 |
| 1 | 1 | 2 | C | 34 | 365 | 45 | 25 | 340.178 |
| 1 | 1 | 2 | WR | 45 | 25 | 45 | 25 | 0 |
| 1 | 1 | 2 | D | 67 | 36 | 45 | 25 | 24.5967 |
Upvotes: 2