Reputation: 1296
I want to perform a function over grouped data in a pandas dataframe. I have the df below and do the following iteratively, but think this should be done by pandas groupby.
import pandas as pd
import scipy
from scipy.stats import mstats
df = pd.DataFrame({'cfs': [147248, 94894, 81792, 176011, 208514, 18111, 56742, 154900, 32778, 142333, 45267, 145211, 3429, 1258, 65439], 'Alternatives':['A','B','C']*5})
alternatives = list(set(df['Alternatives']))
df2 = pd.DataFrame()
for alternative in alternatives:
alt = pd.DataFrame(df[(df.Alternatives == alternative)])
alt = alt.sort_values(['cfs'])
alt['rank'] = alt['cfs'].rank()
alt['pp'] = 1 - scipy.stats.mstats.plotting_positions(alt['cfs'],0,0)
df2 = df2.append(alt)
Output:
Alternatives cfs rank pp
12 A 3429 1.0 0.833333
6 A 56742 2.0 0.666667
9 A 142333 3.0 0.500000
0 A 147248 4.0 0.333333
3 A 176011 5.0 0.166667
5 C 18111 1.0 0.833333
8 C 32778 2.0 0.666667
14 C 65439 3.0 0.500000
2 C 81792 4.0 0.333333
11 C 145211 5.0 0.166667
13 B 1258 1.0 0.833333
10 B 45267 2.0 0.666667
1 B 94894 3.0 0.500000
7 B 154900 4.0 0.333333
4 B 208514 5.0 0.166667
I can get the rank by
df['rank'] = df['cfs'].groupby(df['Alternatives']).rank()
But I cannot get the plotting positions. The closest I have is:
group = df['cfs'].groupby(df['Alternatives']).apply(scipy.stats.mstats.plotting_positions,0,0 )
This gives me a pandas series with the correct data, but what I want to do is:
df['pp'] = df['cfs'].groupby(df['Alternatives']).apply(scipy.stats.mstats.plotting_positions,0,0)
However, this just returns a column of NaN
Thanks
Upvotes: 2
Views: 202
Reputation: 17339
def func(x):
x['pp'] = 1 - scipy.stats.mstats.plotting_positions(x.cfs, 0, 0)
return x
df.groupby('Alternatives').apply(func)
Alternatives cfs pp
0 A 147248 0.333333
1 B 94894 0.500000
2 C 81792 0.333333
3 A 176011 0.166667
4 B 208514 0.166667
5 C 18111 0.833333
6 A 56742 0.666667
7 B 154900 0.333333
8 C 32778 0.666667
9 A 142333 0.500000
10 B 45267 0.666667
11 C 145211 0.166667
12 A 3429 0.833333
13 B 1258 0.833333
14 C 65439 0.500000
What helps to debug groupby
is using get_group
:
g = df.groupby('Alternatives').get_group('A')
g.apply(whatever) # test on a single group and then apply to all at once
Upvotes: 2