Reputation: 312
I have a dataframe sorted by date
:
df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
'date': ['2016-04-30', '2016-05-31', '2016-06-31',
'2016-04-30', '2016-05-31', '2016-06-31'],
'val': [10, 0, 5, 10, 0, 0],
'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort('date')
print df
date idx pct_val val
3 2016-04-30 2 NaN 10
0 2016-04-30 1 NaN 10
4 2016-05-31 2 -10 0
1 2016-05-31 1 -10 0
5 2016-06-31 2 -10 0
2 2016-06-31 1 NaN 5
And I want to group by idx
then apply a cumulative function with some simple logic. If pct_val
is null, add val
to to running total, otherwise multiply running total by 1 + pct_val/100
. 'cumsum'
shows the result of df.groupby('idx').val.cumsum()
and 'cumulative_func'
is the result I want.
date idx pct_val val cumsum cumulative_func
3 2016-04-30 2 NaN 10 10 10
0 2016-04-30 1 NaN 10 10 10
4 2016-05-31 2 -10 0 10 9
1 2016-05-31 1 -10 0 10 9
5 2016-06-31 2 -10 0 10 8
2 2016-06-31 1 NaN 5 15 14
Any idea if there is a way to do apply a custom cumulative function to a dataframe or a better way to achieve this?
Upvotes: 10
Views: 5179
Reputation: 294348
First I cleaned up your setup
df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
'date': ['2016-04-30', '2016-05-31', '2016-06-31',
'2016-04-30', '2016-05-31', '2016-06-31'],
'val': [10, 0, 5, 10, 0, 0],
'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort_values(['date', 'idx'])
print df
Looks like:
date idx pct_val val
0 2016-04-30 1 NaN 10
3 2016-04-30 2 NaN 10
1 2016-05-31 1 -10.0 0
4 2016-05-31 2 -10.0 0
2 2016-06-31 1 NaN 5
5 2016-06-31 2 -10.0 0
def cumcustom(df):
df = df.copy()
running_total = 0
for idx, row in df.iterrows():
if pd.isnull(row.ix['pct_val']):
running_total += row.ix['val']
else:
running_total *= row.ix['pct_val'] / 100. + 1
df.loc[idx, 'cumcustom'] = running_total
return df
Then apply
df.groupby('idx').apply(cumcustom).reset_index(drop=True).sort_values(['date', 'idx'])
Looks like:
date idx pct_val val cumcustom
0 2016-04-30 1 NaN 10 10.0
3 2016-04-30 2 NaN 10 10.0
1 2016-05-31 1 -10.0 0 9.0
4 2016-05-31 2 -10.0 0 9.0
2 2016-06-31 1 NaN 5 14.0
5 2016-06-31 2 -10.0 0 8.1
Upvotes: 1
Reputation: 109546
I don't believe there is an easy way to accomplish your objective using vectorization. I would first try to get something working, and then optimize for speed if required.
def cumulative_func(df):
results = []
for group in df.groupby('idx').groups.itervalues():
total = 0
result = []
for p, v in df.ix[group, ['pct_val', 'val']].values:
if np.isnan(p):
total += v
else:
total *= (1 + .01 * p)
result.append(total)
results.append(pd.Series(result, index=group))
return pd.concat(results).reindex(df.index)
df['cumulative_func'] = cumulative_func(df)
>>> df
date idx pct_val val cumulative_func
3 2016-04-30 2 NaN 10 10.0
0 2016-04-30 1 NaN 10 10.0
4 2016-05-31 2 -10 0 9.0
1 2016-05-31 1 -10 0 9.0
5 2016-06-31 2 -10 0 8.1
2 2016-06-31 1 NaN 5 14.0
Upvotes: 4