Better way to recalculate pandas dataframe fields depending on other fields

Question

I am new to python pandas. And i wanted to estimate a value of inflow payment/s over term considering fees and growth over a period. I only used one payment (inflow) to test. Sometimes fee2 can be applied up to period n-t. i.e. Not for the whole period, n.

I did it like below, just wondering if there is a better way to recalculate the values without looping?

Example in spreadsheet:

Python code:

import pandas as pd
import numpy as np

def getCashFlows():
   term = 2
   growthRate = (1+0.06)**(1/12) - 1
   df = pd.DataFrame(list(range(1,term*12+1)), columns=['t'])
   df['Value_t_1'] = 0
   df['Inflow1']=0
   df['growth']=0
   df['ValuePlusGrowth'] = 0
   df['fee1']=0
   df['fee2']=30
   df['Value_t']=0

   df.set_value(0, 'Inflow1', 10000)

   for i in range(0,term*12):
      df['Value_t_1'] = df['Value_t'].shift()
      df['Value_t_1'].fillna(0,inplace=True)

      df['growth'] = (df['Value_t_1'] + df['Inflow1'])*growthRate
      df['ValuePlusGrowth'] = df['Value_t_1']+df['Inflow1']+df['growth']
      df['fee1']=df['ValuePlusGrowth']*0.5/100
      df['Value_t'] = df['ValuePlusGrowth'] - df['fee1'] - df['fee2']
   return df

andrew_reece · Accepted Answer

The only initial input that's really needed is the initial value for inflow. Everything else can be reduced to operations that are repeated for a set number of times, based on the row index. And some of the columns in your data frame are really just constants.

Here's a solution that clarifies the operations needed to compute each row of the data frame:

import pandas as pd

class GrowthTracker(object):

    def __init__(self, n_iter):

        self.colnames = ['Value_t_1', 'growth', 'ValuePlusGrowth', 'fee1', 'Value_t']
        self.data = None
        self.fee1_mult = 0.5/100
        self.fee2 = (0,0,0,0,30)
        self.growthRate = (1+0.06)**(1/12) - 1
        self.n_iter = n_iter
        self.ops = pd.Series([1, # Value_t_1
                              self.growthRate, # growth
                              (1 + self.growthRate), # ValuePlusGrowth
                              (1 + self.growthRate) * self.fee1_mult, # fee1
                              (1 + self.growthRate) * (1 - self.fee1_mult) # Value_t
                             ])

    def update(self, t, n, df=None):
        row = self.ops.mul(t).subtract(self.fee2)
        tmp = pd.concat([df, row], axis = 1, ignore_index=True)
        if n < self.n_iter: 
            self.data = self.update(row.iloc[-1], n+1, tmp)
            return self.data
        else:
            tmp.iloc[0,0] = 0 # remove the initial 10000 from Value_t_1
            self.data = tmp.T
            self.data.columns = self.colnames
            return self.data

Now just set the initial values, instantiate the GrowthTracker object, and update():

total_iter = 23

tracker = GrowthTracker(n_iter=total_iter)

inflow = 10000
start_index = 0

tracker.update(t=inflow, n=start_index)

tracker.data

      Value_t_1     growth  ValuePlusGrowth       fee1      Value_t
0      0.000000  48.675506     10048.675506  50.243378  9968.432128
1   9968.432128  48.521847     10016.953976  50.084770  9936.869206
2   9936.869206  48.368213      9985.237419  49.926187  9905.311232
3   9905.311232  48.214603      9953.525835  49.767629  9873.758206
4   9873.758206  48.061017      9921.819223  49.609096  9842.210127
5   9842.210127  47.907455      9890.117583  49.450588  9810.666995
6   9810.666995  47.753918      9858.420912  49.292105  9779.128808
7   9779.128808  47.600404      9826.729212  49.133646  9747.595566
8   9747.595566  47.446914      9795.042480  48.975212  9716.067268
9   9716.067268  47.293449      9763.360716  48.816804  9684.543913
10  9684.543913  47.140007      9731.683920  48.658420  9653.025500
11  9653.025500  46.986590      9700.012090  48.500060  9621.512030
12  9621.512030  46.833196      9668.345226  48.341726  9590.003500
13  9590.003500  46.679827      9636.683327  48.183417  9558.499910
14  9558.499910  46.526482      9605.026392  48.025132  9527.001260
15  9527.001260  46.373160      9573.374420  47.866872  9495.507548
16  9495.507548  46.219863      9541.727411  47.708637  9464.018774
17  9464.018774  46.066590      9510.085364  47.550427  9432.534937
18  9432.534937  45.913341      9478.448278  47.392241  9401.056037
19  9401.056037  45.760116      9446.816152  47.234081  9369.582072
20  9369.582072  45.606915      9415.188986  47.075945  9338.113041
21  9338.113041  45.453737      9383.566779  46.917834  9306.648945
22  9306.648945  45.300584      9351.949529  46.759748  9275.189781
23  9275.189781  45.147455      9320.337237  46.601686  9243.735551

I found it easier to express this all as a class, but it's simple enough to just define variables outside of the class and then just run the update() function.

UPDATE
Here's a bit more explanation behind this solution:

The initial data frame df is mostly empty. The only fully non-zero columns are t, which is never used, and fee2, which is a constant (fee2 = 30). The entire remainder of df starts out as zero-valued, with the one exception of the first cell in Inflow1 - its first value is 10000, and the rest of its values are zeros.

That means that, in terms of what we need to accomplish computationally, we can limit our "matrix of interest" to the columns Value_t_1, growth, ValuePlusGrowth, fee1, and Value_t.

We can think of that first Inflow1 value as the seed - everything else is just a string of operations performed on the number 10000. (In fact, we don't actually need Inflow1 as a field, as all of its other values remain zero throughout the calculations.)

In your loop, you initially updated columns using the values of other columns. That makes sense, it's probably how I would have done it too - looks neat and efficient. Recall, however, that every update is really just a string of math tracing its lineage back to that original 10000. Writing out the actual operations for each column update, instead of using other column names, shows how each update operation can be simplified.

First, a few shorthand notations:

t = Value_t from previous row (in case of the first row, Value_t = Inflow1 = 10000)
t1 = Value_t_1 
g = growth
inf = Inflow1 
vpg = ValuePlusGrowth
gr = growthRate # gr is a constant: (1+0.06)**(1/12) - 1
f1X = 0.5/100
new_t = Value_t for current row

We start with t = 10000. Everything else is some operation on t.

Every value can be expressed in terms of what we need to multiply t by in order to get the desired value (with one exception that I'll get to later). So for example:

df['Value_t_1'] = df['Value_t'].shift()
df['Value_t_1'].fillna(0,inplace=True)

# equivalent to:
t1 = 1 * t # recall t is the shifted Value_t from the previous row

Keep in mind that we only need to drop in the seed value t once, and then it's just operations on the seed to populate all of df. That means that the operations in your loop can be expressed as "the terms that need to be multiplied by t in order to get the correct column value". So although we have shown that t1 = 1 * t, it's more useful for us to think of t1 = 1 - eventually we'll multiply that by t, but the right side of that equation represents t1's relationship to t.

Then:

t1 = 1

# Inflow1 is always 0, except for its initial value which we capture in initial t, so:
df['growth'] = (df['Value_t_1'] + df['Inflow1'])*growthRate
# becomes:
g = t1 * gr 
# with t1 = 1
g = gr

# we know t1 = 1, and inf is never used as a multiplier, so:
df['ValuePlusGrowth'] = df['Value_t_1']+df['Inflow1']+df['growth']
# becomes:
vpg = 1 + g = 1 + gr

df['fee1']=df['ValuePlusGrowth']*0.5/100
# becomes:
fee1 = vpg * f1X = (1 + gr) * f1X

# we'll ignore subtracting fee2 for now, see notes below.
df['Value_t'] = df['ValuePlusGrowth'] - df['fee1'] - df['fee2']
# becomes:
new_t = vpg - fee1 = (1 + gr) - ((1 + gr) * f1X) = (1 + gr) * (1 - f1X)

ops = (t1, g, vpg, fee1, new_t)

Now, for each row, we have a set of update operations ops for each column. Given that we have t from the previous row, we can populate values for each row with:

new_row = t * ops

We still need to subtract fee2 from new_t, and that doesn't fit neatly into what, up until this point, had been a series of multiplicative operations. But we can stick with our vectorized formulation and define:

fee2 = (0,0,0,0,30)

After each new_row we subtract the fee2 vector from the new_row vector, which really just subtracts fee2 from new_t as desired.

new_row = t * ops - fee2

At this point, we just need a function that starts with t = 10000 and keeps executing the new_row formula, building on each previous row, until we've reached our desired number of iterations. I chose a recursive strategy to do this, and saved each new_row to a data frame at every recursive step.

Finally, since I somewhat abused your original notation by setting t = 10000 instead of Inflow1 = 10000, this means the first t1 value is incorrectly set at 10000. At the end of the update() function we set this first t1 value back to 0.

Better way to recalculate pandas dataframe fields depending on other fields

Answers (1)

Related Questions