Reputation: 715
I have a Dataframe as below.
Name 2001 2002 2003 2004 2005 2006
Name1 2 5 0 0 4 6
Name2 1 4 2 0 4 0
Name3 0 5 0 0 0 2
I wanted to calculate the cumulative average for each row using pandas, But while calculating the Average It has to ignore if the value is zero.
The expected output is as below.
Name 2001 2002 2003 2004 2005 2006
Name1 2 3.5 3.5 3.5 3.75 4.875
Name2 1 2.5 2.25 2.25 3.125 3.125
Name3 0 5 5 5 5 3.5
Upvotes: 4
Views: 1980
Reputation: 1549
Given the following dataframe:
import pandas as pd
data = {
'2001': {'Name1': 2, 'Name2': 1, 'Name3': 0},
'2002': {'Name1': 5, 'Name2': 4, 'Name3': 5},
'2003': {'Name1': 0, 'Name2': 2, 'Name3': 0},
'2004': {'Name1': 0, 'Name2': 0, 'Name3': 0},
'2005': {'Name1': 4, 'Name2': 4, 'Name3': 0},
'2006': {'Name1': 6, 'Name2': 0, 'Name3': 2}
}
df = pd.DataFrame(data)
You basically need to perform a cumsum
(column-wise, so axis=1
) and then divide it by the cumsum
of all elements that are not 0
.
This can be done with:
df.cumsum(axis=1) / (df != 0).cumsum(axis=1)
The result will be:
2001 2002 2003 2004 2005 2006
Name1 2.0 3.5 3.500000 3.500000 3.666667 4.25
Name2 1.0 2.5 2.333333 2.333333 2.750000 2.75
Name3 NaN 5.0 5.000000 5.000000 5.000000 3.50
Upvotes: 0
Reputation: 175
def cumavg(s):
avg=[s[0]]
for i in range(1,len(s)):
if s[i]!=0:
if avg[i-1] ==0:
avg.append(s[i])
else:
avg.append((s[i]+avg[i-1])/2)
else:
avg.append(avg[-1])
return np.array(avg)
df.apply(lambda s:cumavg(s),axis='columns')
Upvotes: 0
Reputation: 939
First, it seems to me that there is a math problem in your calculations also according to your comments. If in the first row you are taking 3.5 (that is what you call current cumulative) PLUS 4 (the value at 2005 in the first row) and gives --> (3.5+4)/2= 3.75, then there is an error later in row two column 2005. In fact, there you have (2.25 + 4)/2 = 3.125. What you wrote is 3.175.
Now I believe there are a much better ways to implement my solution, but I get what you need here.
def cumulative_av(x):
b=[]
b.append(x[0])
for i in range(1,len(x)):
if(x[i]!=0 and b[-1]!=0):
b.append((x[i]+b[-1])*0.5)
elif(x[i]!=0 and b[-1]==0):
b.append(x[i])
elif(x[i]==0 and b[-1]!=0):
b.append(b[-1])
elif(x[i]==0 and b[-1]==0):
b.append(0)
return(b)
apd2=pd.DataFrame(columns=["2001", "2002", "2003", "2004", "2005", "2006"])
for i in range(3):
apd2.loc[i]=cumulative_av(apd.loc[i,].to_list())
Where "apd" is your very initial pandas dataframe. cumulative_av is a function that generates what you have defined (it is a very strange function in my opinion).
Here the results of my code:
2001 2002 2003 2004 2005 2006
0 2.0 3.5 3.50 3.50 3.750 4.875
1 1.0 2.5 2.25 2.25 3.125 3.125
2 0.0 5.0 5.00 5.00 5.000 3.500
Upvotes: 3