Jeff Tilton
Jeff Tilton

Reputation: 1296

Pandas perform operation over grouped data

I want to perform a function over grouped data in a pandas dataframe. I have the df below and do the following iteratively, but think this should be done by pandas groupby.

import pandas as pd
import scipy
from scipy.stats import mstats 

df = pd.DataFrame({'cfs': [147248, 94894, 81792, 176011, 208514, 18111, 56742, 154900, 32778, 142333, 45267, 145211, 3429, 1258, 65439], 'Alternatives':['A','B','C']*5})

alternatives = list(set(df['Alternatives']))

df2 = pd.DataFrame()

for alternative in alternatives:
    alt = pd.DataFrame(df[(df.Alternatives == alternative)])
    alt = alt.sort_values(['cfs'])
    alt['rank'] = alt['cfs'].rank()
    alt['pp'] = 1 - scipy.stats.mstats.plotting_positions(alt['cfs'],0,0) 
    df2 = df2.append(alt) 

Output:

Alternatives     cfs  rank        pp
12            A    3429   1.0  0.833333
6             A   56742   2.0  0.666667
9             A  142333   3.0  0.500000
0             A  147248   4.0  0.333333
3             A  176011   5.0  0.166667
5             C   18111   1.0  0.833333
8             C   32778   2.0  0.666667
14            C   65439   3.0  0.500000
2             C   81792   4.0  0.333333
11            C  145211   5.0  0.166667
13            B    1258   1.0  0.833333
10            B   45267   2.0  0.666667
1             B   94894   3.0  0.500000
7             B  154900   4.0  0.333333
4             B  208514   5.0  0.166667

I can get the rank by

df['rank'] = df['cfs'].groupby(df['Alternatives']).rank()

But I cannot get the plotting positions. The closest I have is:

group = df['cfs'].groupby(df['Alternatives']).apply(scipy.stats.mstats.plotting_positions,0,0 ) 

This gives me a pandas series with the correct data, but what I want to do is:

df['pp'] = df['cfs'].groupby(df['Alternatives']).apply(scipy.stats.mstats.plotting_positions,0,0)  

However, this just returns a column of NaN

Thanks

Upvotes: 2

Views: 202

Answers (1)

Dennis Golomazov
Dennis Golomazov

Reputation: 17339

def func(x):
    x['pp'] = 1 - scipy.stats.mstats.plotting_positions(x.cfs, 0, 0)
    return x

df.groupby('Alternatives').apply(func)

   Alternatives     cfs        pp
0             A  147248  0.333333
1             B   94894  0.500000
2             C   81792  0.333333
3             A  176011  0.166667
4             B  208514  0.166667
5             C   18111  0.833333
6             A   56742  0.666667
7             B  154900  0.333333
8             C   32778  0.666667
9             A  142333  0.500000
10            B   45267  0.666667
11            C  145211  0.166667
12            A    3429  0.833333
13            B    1258  0.833333
14            C   65439  0.500000

What helps to debug groupby is using get_group:

g = df.groupby('Alternatives').get_group('A')
g.apply(whatever)  # test on a single group and then apply to all at once

Upvotes: 2

Related Questions