Reputation: 13
I'm trying to create a function which can look at previous rows in a DataFrame and sum them based on a set number of rows to look back over. Here I have used 3 but ideally I would like to scale it up to look back over more rows. My solution works but doesn't seem very efficient. The other criteria is each time it hits a new team the count must start again, so the first row for each new team is always 0, the data will be ordered in team order but if a solution is known for where the data isn't in team order this would be incredible. Is there a function in Pandas which could help with this?
So far I've tried the code below and tried googling the issue, the closest example I could find is: here! but this groups the index and I'm unsure how to apply this when the value has to keep resetting each time it hits a new team, as it wouldn't distinguish each time there is a new team.
np.random.seed(0)
data = {'team':['a','a','a','a','a','a','a','a','b','b',
'b','b','b','b','b','b','c','c','c','c','c','c','c','c'],
'teamPoints': np.random.randint(0,4,24)}
df = pd.DataFrame.from_dict(data)
df.reset_index(inplace=True)
def find_sum_last_3(x):
if x == 0:
return 0
elif x == 1:
return df['teamPoints'][x-1]
elif x == 2:
return df['teamPoints'][x-1] + df['teamPoints'][x-2]
elif df['team'][x] != df['team'][x-1]:
return 0
elif df['team'][x] != df['team'][x-2]:
return df['teamPoints'][x-1]
elif df['team'][x] != df['team'][x-3]:
return df['teamPoints'][x-1] + df['teamPoints'][x-2]
else:
return df['teamPoints'][x-1] + df['teamPoints'][x-2] +
df['teamPoints'][x-3]
df['team_form_3games'] = df['index'].apply(lambda x : find_sum_last_3(x))
The first part of the function addresses the edge cases where a sum of 3 isn't possible because there are less than 3 elements
The second part of the function addresses the problem of the 'team' changing. When the team changes the sum needs to start again, so each 'team' is considered seperately
The final part simply looks at the previous 3 elements of the dataFrame and sums them together.
This example works as expected and gives a new column with expected output as follows:
0, 0, 3, 4, 4, 4, 6, 9, 0, 1, 4, 5, 6, 3, 5, 5, 0, 0, 0, 2, 3, 5, 6, 8
1st element is 0 as it is edge case, 2nd is 0 because the sum of the first element is 0. 3rd is 3 as the sum of the 1st and 2nd elements are 3. 4th is the sum of 1st,2nd,3rd. 5th is sum of 2nd,3rd,4th. 6th is sum of 3rd,4th,5th
However when scaled up to 10 it is shown to be very inefficient which makes it difficult to scale up to 10 or 15. It is also inelegant and a new function needs to be written for each different length of sum.
Upvotes: 1
Views: 227
Reputation: 30930
I think you are looking for GroupBy.apply + rolling
:
r3=df.groupby('team')['teamPoints'].apply(lambda x: x.rolling(3).sum().shift())
r2=df.groupby('team')['teamPoints'].apply(lambda x: x.rolling(2).sum().shift())
r1=df.groupby('team')['teamPoints'].apply(lambda x: x.shift())
df['team_form_3games'] = r3.fillna(r2.fillna(r1).fillna(0))
print(df)
Output:
index team teamPoints team_form_3games
0 0 a 0 0.0
1 1 a 3 0.0
2 2 a 1 3.0
3 3 a 0 4.0
4 4 a 3 4.0
5 5 a 3 4.0
6 6 a 3 6.0
7 7 a 3 9.0
8 8 b 1 0.0
9 9 b 3 1.0
10 10 b 1 4.0
11 11 b 2 5.0
12 12 b 0 6.0
13 13 b 3 3.0
14 14 b 2 5.0
15 15 b 0 5.0
16 16 c 0 0.0
17 17 c 0 0.0
18 18 c 2 0.0
19 19 c 1 2.0
20 20 c 2 3.0
21 21 c 3 5.0
22 22 c 3 6.0
23 23 c 2 8.0
Upvotes: 1