Cumulative sum based on another dataframe column without itertuples?

Question

I can't wrap my head around solving this problem without itertuples

I want to take mean values from every row that the cumulative adds up to less than 1/3rd of total sum of a different column

Starting dataframe:

df = pd.DataFrame({'model_1': [0.15, 0.19, 0.25, 0.54, 0.55 , 0.98, 1.12],
                   'model_2': [0.12, 0.13, 0.32, 0.45, 0.6 , 0.7, 1.05],
                   'exposure': [0.4, 1, 1.6, 1, 2, 2, 3],
                   'target': [0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1.1]})

Here we see that the sum of the exposure is 11, and my intention is to make 3 buckets, take the mean value of all of the rows that have a cumulative sum less than or equal to 1/3 of the total exposure

So we can see that the first 4 rows have a cumulative sum of 4, then I want to take a relative mean of those columns.

This means the first value in aggr_model_1 is:

((0.15 * 0.4) + (0.19 * 1) + (0.25 * 1.6) + 0.54)/4 = 0.2975

This same process is then applied for aggr_model_2 and aggr_target

Output dataframe:

output_df = pd.DataFrame({'aggr_model_1': [0.2975, 0.765, 1.12],
                          'aggr_model_2': [0.285, 0.65, 1.05],
                          'aggr_exposure': [4, 4, 3],
                          'aggr_target': [0.28, 0.65, 1.1]})

psarka · Accepted Answer

I'll take a shot, see if I understood it right. Ingredients of this computation are:

total exposure, which we can compute as total = df.exposure.sum()
bins dividing the total exposure into 3 parts bins = np.linspace(0, total, 4)
cumulative exposure, which is cum_exposure = df.exposure.cumsum()
binned cumulative exposure bin_cum_exposure = pd.cut(cum_exposure, bins)
exposure weighted observations w_model_1 = df.exposure * df.model_1
mean! df.groupby('bin_cum_exposure').w_model_1.mean()

Putting things together:

total = df.exposure.sum()
bins = np.linspace(0, total, 4)

(df.assign(bin_cum_exposure = lambda x: pd.cut(x.exposure.cumsum(), bins),
           w_model_1 = lambda x: x.exposure * x.model_1,
           w_model_2 = lambda x: x.exposure * x.model_2,
           w_total = lambda x: x.exposure * x.target)
   .groupby('bin_cum_exposure')
   .mean()
)

The answer differs from your manual computation, as the first bin has 3 elements rather than 4 as in your example.

Cumulative sum based on another dataframe column without itertuples?

Answers (1)

Related Questions