Reputation: 85
I am trying to create a column that is a conditional cumulative sum in Pandas. I know how to do this easily in Excel but am really struggling to work out how I can achieve this in Pandas. Consider the following dataframe
df:
Year Income
0 2001 17
1 2001 4
1 2001 9
2 2002 6
2 2002 1
2 2002 1
3 2003 12
3 2003 11
3 2003 5
I would like to add a new column to the dataframe. The new column should contain the sum of all Incomes where:
So the output would look like:
df:
Year Income Cum_Income
0 2001 17 13 # 2 values less than 17 in 2001: 4 + 9 = 13
1 2001 4 0 # no values less than 4 in 2001: 0
1 2001 9 4 # 1 value less than 9 in 2001 : 4
2 2002 6 2 # etc.
2 2002 1 0
2 2002 1 0
3 2003 12 16
3 2003 11 5
3 2003 5 0
The only way I can think of to achieve this is to loop through each row and calculate each cell in Cum_income separately, but I'm hopeful that there is a cleaner way than that. Any suggestions would be very helpful.
Upvotes: 1
Views: 193
Reputation: 323226
Some magic from numpy
broadcast
s1 = df.Year.values
s2 = df.Income.values
np.sum((s1==s1[:,None])*((s2<s2[:,None])*s2), axis=1)
Out[246]: array([13, 0, 4, 2, 0, 0, 16, 5, 0], dtype=int64)
#df['New'] = np.sum((s1==s1[:,None])*((s2<s2[:,None])*s2), axis=1)
Explain
(s1==s1[:,None])
match the year , if the year not same we do not check the value , in your case it will return False , value * False will be 0
(s2<s2[:,None])*s2
check the value of current row whether have the value is greater than it , if return True , then True multiple the row value will be the value for future to add up.
Upvotes: 1