Reputation: 1647
I'm working with a pandas DataFrame similar to the one below.
School students
0 A 44
1 B 38
2 C 33
3 D 29
4 E 28
5 F 25
6 G 23
I've created a function that's intended to iterate through each row and accumulate the number of students across school until the sum is greater or equal to 75% of all students. Then the function should return the index of the dataframe. (The column is already sorted.) My code below isn't working. Can you tell me what is wrong? The error message is below it.
percentile = .75
def get_top(df,perc=percentile):
thresh = perc*df['students'].sum()
cum = 0
for index, row in df.iterrows() :
cum = cum + row['students']
if cum >= thresh:
return index-1
break
output = df.apply(get_top)
KeyError: ('students', u'occurred at index School')
Upvotes: 0
Views: 969
Reputation: 1647
As Jarad indicated in the comment section, I needed to change the function call to:
output = get_top(df,perc=percentile)
Upvotes: 0
Reputation: 862671
You can use numpy.where
with cumsum
:
print (0.75*df['students'].sum())
165.0
print (df.students.cumsum())
0 44
1 82
2 115
3 144
4 172
5 197
6 220
Name: students, dtype: int64
df['out'] = np.where(df.students.cumsum() >= 0.75*df['students'].sum(),
df.index,
df.students.cumsum())
print (df)
School students out
0 A 44 44
1 B 38 82
2 C 33 115
3 D 29 144
4 E 28 4
5 F 25 5
6 G 23 6
Or if you want use percentile
- function quantile
:
print (df.students.quantile(.75))
35.5
df['out'] = np.where(df.students >= df.students.quantile(.75),
df.students.cumsum(),
df.index)
print (df)
School students out
0 A 44 44
1 B 38 82
2 C 33 2
3 D 29 3
4 E 28 4
5 F 25 5
6 G 23 6
Upvotes: 1