Insendive
Insendive

Reputation: 25

Calulations per category with pandas

I have difficulties using pandas combined with a for-loop and if-statement correctly. I will try to illustrate what I am trying to achieve and what I am currently doing with an example.

Consider I have six persons (the index represents the person) who have followed some courses A, B, C. All of them have obtained a grade as indicated below:

Index    Grade     Course
0        2         A
1        4.5       B
2        6         B
3        6.5       C
4        7.5       A
5        9         A

Now suppose I want to obtain the cumulative average grade (in my real problem I want to obtain the emperical cumulative distribution function, but I hope more people can help me by keeping it simple an focusing on the real issue). Denote the grade for person 0 as g_0, for person 1 as g_1, etc. Then the cumulative average grade for person 0 could be calculated as (note that I have sorted by grades on purpose): g_0 / 6. For person 1 as: (g_0 + g_1) / 6. And so on.

Now what I really want to do is to do this per category. So for every category I want to calculate this cumulative average separately, but eventually in the same column. In that way it would function such that the grades of the people could be compared in another way.

Let us denote the cumulative average grade by Cag, then we would like to have:

Index    Grade     Course     Cag
0        2         A          0.66      (2/3)  
1        4.5       B          2.25      (4.5/2)
2        6         B          5.25      ((4.5 + 6)/2)
3        6.5       C          6.5       (6.5/1)
4        7.5       A          3.17      ((2 + 7.5) / 3)
5        9         A          6.17      ((2 + 7.5 + 9) / 3)

Eventually I want to do this in a for-loop, as I have a lot of courses instead of 3. Now what I would do (but does not work) is the following:

 df = pd.DataFrame({ 'grd' : (2, 4.5, 6, 6.5, 7.5, 9),
                'cou' : pd.Categorical(["A", "B", "B", "C", "A", "A"])})
 iLen = len(df.grd)
 df['cgr'] = pd.Series(np.zeros(iLen), index = df.index)

 lCou = set(df.cou)


 for i in lCou:
     z = 0 
     n = sum(df.cou == i)
     for row in df['cou']:
         if i == row:
             z = (z + df['grd']) / n
             df['cgr'] = z

Unfortunately, it does not do what I want it to do. Thanks in advance for trying to help me out!

Upvotes: 1

Views: 40

Answers (1)

zipa
zipa

Reputation: 27869

This should do it:

df = pd.DataFrame({'Grade': [2.0, 4.5, 6.0, 6.5, 7.5, 9.0],
                   'Index': [0, 1, 2, 3, 4, 5],
                   'Course': ['A', 'B', 'B', 'C', 'A', 'A']})

df['Cag'] = (df.groupby('Course')['Grade'].transform('cumsum') / df.groupby('Course')['Index'].transform('count')).round(2)

df
#  Course  Grade  Index   Cag
#0      A    2.0      0  0.67
#1      B    4.5      1  2.25
#2      B    6.0      2  5.25
#3      C    6.5      3  6.50
#4      A    7.5      4  3.17
#5      A    9.0      5  6.17

Upvotes: 2

Related Questions