Anji
Anji

Reputation: 715

Calculate Cumulative Average using Pandas

I have a Dataframe as below.

Name  2001 2002 2003 2004 2005 2006  
Name1  2    5     0    0    4    6  
Name2  1    4     2    0    4    0  
Name3  0    5     0    0    0    2  

I wanted to calculate the cumulative average for each row using pandas, But while calculating the Average It has to ignore if the value is zero.

The expected output is as below.

Name  2001  2002  2003  2004  2005  2006  
Name1  2    3.5    3.5  3.5   3.75  4.875  
Name2  1    2.5   2.25  2.25  3.125 3.125  
Name3  0     5     5     5    5     3.5  

Upvotes: 4

Views: 1980

Answers (3)

villoro
villoro

Reputation: 1549

Given the following dataframe:

import pandas as pd
data = {
    '2001': {'Name1': 2, 'Name2': 1, 'Name3': 0},
    '2002': {'Name1': 5, 'Name2': 4, 'Name3': 5},
    '2003': {'Name1': 0, 'Name2': 2, 'Name3': 0},
    '2004': {'Name1': 0, 'Name2': 0, 'Name3': 0},
    '2005': {'Name1': 4, 'Name2': 4, 'Name3': 0},
    '2006': {'Name1': 6, 'Name2': 0, 'Name3': 2}
}
df = pd.DataFrame(data)

You basically need to perform a cumsum (column-wise, so axis=1) and then divide it by the cumsum of all elements that are not 0.

This can be done with:

df.cumsum(axis=1) / (df != 0).cumsum(axis=1)

The result will be:

       2001  2002      2003      2004      2005  2006
Name1   2.0   3.5  3.500000  3.500000  3.666667  4.25
Name2   1.0   2.5  2.333333  2.333333  2.750000  2.75
Name3   NaN   5.0  5.000000  5.000000  5.000000  3.50

Upvotes: 0

Mohammed Khalid
Mohammed Khalid

Reputation: 175

def cumavg(s):
    avg=[s[0]]
    for i in range(1,len(s)):
        if s[i]!=0:
            if avg[i-1] ==0:
                avg.append(s[i])
            else:    
                avg.append((s[i]+avg[i-1])/2)
        else:
            avg.append(avg[-1])
    return np.array(avg)
df.apply(lambda s:cumavg(s),axis='columns')

Upvotes: 0

Fabrizio
Fabrizio

Reputation: 939

First, it seems to me that there is a math problem in your calculations also according to your comments. If in the first row you are taking 3.5 (that is what you call current cumulative) PLUS 4 (the value at 2005 in the first row) and gives --> (3.5+4)/2= 3.75, then there is an error later in row two column 2005. In fact, there you have (2.25 + 4)/2 = 3.125. What you wrote is 3.175.

Now I believe there are a much better ways to implement my solution, but I get what you need here.

def cumulative_av(x):
    b=[]
    b.append(x[0])
    for i in range(1,len(x)):
        if(x[i]!=0 and b[-1]!=0):
            b.append((x[i]+b[-1])*0.5)
        elif(x[i]!=0 and b[-1]==0):
            b.append(x[i])
        elif(x[i]==0 and b[-1]!=0):
            b.append(b[-1])
        elif(x[i]==0 and b[-1]==0):
            b.append(0)

    return(b)


apd2=pd.DataFrame(columns=["2001", "2002", "2003", "2004", "2005", "2006"])
for i in range(3):
    apd2.loc[i]=cumulative_av(apd.loc[i,].to_list())

Where "apd" is your very initial pandas dataframe. cumulative_av is a function that generates what you have defined (it is a very strange function in my opinion).

Here the results of my code:

    2001    2002    2003    2004    2005    2006
0   2.0     3.5     3.50    3.50    3.750   4.875
1   1.0     2.5     2.25    2.25    3.125   3.125
2   0.0     5.0     5.00    5.00    5.000   3.500

Upvotes: 3

Related Questions