Reputation: 431
I have three columns in a DataFrame. I want to take the number in the Streak_Count column and sum up that number of cells from the returns in the MON TOTAL. The result is displayed in the WANTED RESULT as shown below. The issue I cant figure out is summing the number of cells which can be any number>> in this example between 1 and 4.
MON TOTAL STREAK_COUNT WANTED RESULT
1/2/1992 1.123077 1 1.123077 (only 1 so 1.12)
2/3/1992 -1.296718 0
3/2/1992 -6.355612 2 -7.65233 (sum of -1.29 and -6.35)
4/1/1992 5.634692 0
5/1/1992 4.180605 2 9.815297 (sum of 5.63 and 4.18)
7/1/1992 -0.101016 0
8/3/1992 -0.706125 2 -0.807141 (sum of -.10 and -.706)
10/1/1992 0.368579 0
11/2/1992 3.822277 0
1/4/1993 2.233359 0
2/1/1993 15.219644 4 21.643859
3/1/1993 -2.647693 1 -2.647693
4/1/1993 1.599094 1 1.599094
Upvotes: 1
Views: 3551
Reputation: 11453
EDIT 1:
Efficient version that performs operations on entire column rather than iterating through and utilizes object/iterator thus avoiding lists.
The changes here are that,
1) We first get Series for "STREAK_COUNT" and convert to tuple
with index as first value and "STREAK_COUNT" as second using enumerate
. Since enumerate
is an iterable, we can use it directly in step(3) instead of converting to list.
2) Define getTotal()
function and process to each value in the (1) independently to generate corresponding "RESULT" value.
3) map getTotal()
to entire "STREAK_COUNT" enumerate object generated in step(1) to get desired "RESULT" column.
Working Code
def getTotal(x):
pos = [i for i in range(x[0],x[0]-x[1],-1)]
total = sum(df.iloc[pos, 0])
func = lambda x : x if x !=0 else ''
return func(total)
df[ "RESULT" ] = map(lambda x: getTotal(x), enumerate(df["STREAK_COUNT"]))
print df
Older Version
Not as efficient since we need to iterate through rows and columns, but works for any values within STREAK_COUNT and more clear to understand.
Sample Code
import pandas as pd
df = pd.read_csv("sample.csv", index_col = 0)
#iterate over rows with index
for idx, (lab, row) in enumerate(df.iterrows()):
#get current STREAK_COUNT value
count = int(df.iloc[idx, 1])
#get previous positions based on count
pos = [i for i in range(idx,idx-count,-1)]
#count total
total = sum(df.iloc[pos, 0])
#create new column value based on total
func = lambda x : x if x !=0 else ''
df.loc[lab, "RESULT"] = func(total)
print df
RESULT
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
MON TOTAL STREAK_COUNT RESULT
1/2/1992 1.123077 1 1.12308
2/3/1992 -1.296718 0
3/2/1992 -6.355612 2 -7.65233
4/1/1992 5.634692 0
5/1/1992 4.180605 2 9.8153
7/1/1992 -0.101016 0
8/3/1992 -0.706125 2 -0.807141
10/1/1992 0.368579 0
11/2/1992 3.822277 0
1/4/1993 2.233359 0
2/1/1993 15.219644 4 21.6439
3/1/1993 -2.647693 1 -2.64769
4/1/1993 1.599094 1 1.59909
>>>
Upvotes: 0
Reputation: 86328
It's all about finding the right thing to group by. In this case, a reversed cumulative sum of STREAK_COUNT
will give you what you want.
First we create the dataframe:
import pandas as pd
>>> df = pd.DataFrame({'MON TOTAL':[1.123077, -1.296178, -6.355612, 5.634692, 4.180605, -0.101016, -0.706125,
0.368579, 3.822277, 2.233359, 15.219644, -2.647693, 1.599094],
'STREAK_COUNT':[1, 0, 2, 0, 2, 0, 2, 0, 0, 0, 4, 1, 1]},
index=['1/2/1992', '2/3/1992', '3/2/1992', '4/1/1992', '5/1/1992', '7/1/1992', '8/3/1992',
'10/1/1992', '11/2/1992', '1/4/1993', '2/1/1993', '3/1/1993', '4/1/1993'])
>>> df
MON TOTAL STREAK_COUNT
1/2/1992 1.123077 1
2/3/1992 -1.296178 0
3/2/1992 -6.355612 2
4/1/1992 5.634692 0
5/1/1992 4.180605 2
7/1/1992 -0.101016 0
8/3/1992 -0.706125 2
10/1/1992 0.368579 0
11/2/1992 3.822277 0
1/4/1993 2.233359 0
2/1/1993 15.219644 4
3/1/1993 -2.647693 1
4/1/1993 1.599094 1
Next find the groups, compute the sum of each group, and join the results to the original dataframe:
>>> groups = df['STREAK_COUNT'][::-1].cumsum()[::-1]
>>> df['RESULT'] = df.groupby(groups)['MON TOTAL'].transform('sum')
>>> df
MON TOTAL STREAK_COUNT RESULT
1/2/1992 1.123077 1 1.123077
2/3/1992 -1.296178 0 -7.651790
3/2/1992 -6.355612 2 -7.651790
4/1/1992 5.634692 0 9.815297
5/1/1992 4.180605 2 9.815297
7/1/1992 -0.101016 0 -0.807141
8/3/1992 -0.706125 2 -0.807141
10/1/1992 0.368579 0 21.643859
11/2/1992 3.822277 0 21.643859
1/4/1993 2.233359 0 21.643859
2/1/1993 15.219644 4 21.643859
3/1/1993 -2.647693 1 -2.647693
4/1/1993 1.599094 1 1.599094
If you just want results for the end of each streak, then use a mask to filter it:
>>> df[df['STREAK_COUNT'] > 0]
MON TOTAL STREAK_COUNT RESULT
1/2/1992 1.123077 1 1.123077
3/2/1992 -6.355612 2 -7.651790
5/1/1992 4.180605 2 9.815297
8/3/1992 -0.706125 2 -0.807141
2/1/1993 15.219644 4 21.643859
3/1/1993 -2.647693 1 -2.647693
4/1/1993 1.599094 1 1.599094
Upvotes: 1