Reputation: 373
I have a fairly specific algorithm I want to follow.
Basically I have a dataframe as follows:
month taken score
1 1 2 23
2 1 1 34
3 1 2 12
4 1 2 59
5 2 1 12
6 2 2 23
7 2 1 43
8 2 2 45
9 3 1 43
10 3 2 43
11 4 1 23
12 4 2 94
I want to make it so that the 'score' column is changed to 100 on days where taken == 2 continuously until the end of that month. So, not all occurrences of taken == 2 have their score set to 100, if any day following during that month has a taken == 1.
So the result I'd want is:
month taken score
1 1 2 23
2 1 1 34
3 1 2 100
4 1 2 100
5 2 1 12
6 2 2 23
7 2 1 43
8 2 2 100
9 3 1 43
10 3 2 43
11 3 1 23
12 3 2 100
13 4 1 32
14 4 2 100
I've written this code which I feel should do it:
#iterate through months
for month in range(12):
#iterate through scores
for score in range(len(df_report.loc[df_report['month'] == month+1])):
#starting from the bottom, of that month, if 'taken' == 2...
if df_report.loc[df_report.month==month+1, 'taken'].iloc[-score-1] == 2:
#then set the score to 100
df_report.loc[df_report.month==month+1, 'score'].iloc[-score-2] = 100
#if you run into a 'taken' == 1, move on to next month
else: break
However, this doesn't appear to change any values, despite not throwing an error... it also doesn't give me an error about setting values to a copied dataframe.
Could anyone explain what I'm doing wrong?
Upvotes: 0
Views: 338
Reputation: 30605
You can do
import numpy as np
def get_value(x):
s = x['taken']
# Get a mask of duplicate sequeence and change values using np.where
mask = s.ne(s.shift()).cumsum().duplicated(keep=False)
news = np.where(mask,100,x['score'])
# if last number is 2 then change the news value to 100
if s[s.idxmax()] == 2: news[-1] = 100
return pd.Series(news)
df['score'] = df.groupby('month').apply(get_value).values
Output :
month taken score 1 1 2 23 2 1 1 34 3 1 2 100 4 1 2 100 5 2 1 12 6 2 2 23 7 2 1 43 8 2 2 100 9 3 1 43 10 3 2 100 11 4 1 23 12 4 2 100
Almost identical speed but @coldspeed is winner
ndf = pd.concat([df]*10000).reset_index(drop=True)
%%timeit
ndf['score'] = ndf.groupby('month').apply(foo)
10 loops, best of 3: 40.8 ms per loop
%%timeit
ndf['score'] = ndf.groupby('month').apply(get_value).values
10 loops, best of 3: 42.6 ms per loop
Upvotes: 2
Reputation: 402263
The reason for your values not being updated is that assignment to iloc
updates the copy returned by the preceding loc
call, so the original is not touched.
Here's how I'd tackle this. First, define a function foo
.
def foo(df):
for i in reversed(df.index):
if df.loc[i, 'taken'] != 2:
break
df.loc[i, 'score'] = 100
i -= 1
return df
Now, groupby
month
and call foo
:
df = df.groupby('month').apply(foo)
print(df)
month taken score
1 1 2 23
2 1 1 34
3 1 2 100
4 1 2 100
5 2 1 12
6 2 2 23
7 2 1 43
8 2 2 100
9 3 1 43
10 3 2 100
11 4 1 23
12 4 2 100
Obviously, apply
has its shortcomings, but I cannot think of a vectorised approach to this problem.
Upvotes: 2