Problem with pandas.DataFrame.cumsum function

Question

I have the following dataframe in python:

month = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4]
active = [1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1]
data1 = [1709.1,3869.7,4230.4,4656.9,48566.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,93738.2,189293.2,194412.6,206585.8]
df = pd.DataFrame({
                    'month' : month,
                    'active' : active,
                    'd1' : data1,
                    'calculate' : 0,
                });

and I want to calculate the column 'calculate', in the following way:

    month  active        d1  calculate
0       1       1    1709.1     569.70
1       2       1    3869.7    1859.60
2       3       1    4230.4    3269.73
3       4       1    4656.9    4822.03
4       5       0   48566.0       0.00
5       6       0       0.0       0.00
6       7       0       0.0       0.00
7       8       0       0.0       0.00
8       9       0       0.0       0.00
9      10       0       0.0       0.00
10     11       0       0.0       0.00
11     12       0       0.0       0.00
12     13       0       0.0       0.00
13     14       0       0.0       0.00
14     15       0       0.0       0.00
15     16       0       0.0       0.00
16     17       0       0.0       0.00
17     18       0       0.0       0.00
18     19       0       0.0       0.00
19     20       0       0.0       0.00
20      1       1   93738.2   31246.07
21      2       1  189293.2   94343.80
22      3       1  194412.6  159148.00
23      4       1  206585.8  228009.93

I am doing it in the following way:

df['calculate'] = np.where(
                        df.month > 1,
                        np.where(
                                    df.active,
                                    (df.d1/3).cumsum(),
                                    0,
                        ),
                        (df['d1']/3)
                    )

but the result is not what was expected:

    month  active        d1      calculate
0       1       1    1709.1     569.700000
1       2       1    3869.7    1859.600000
2       3       1    4230.4    3269.733333
3       4       1    4656.9    4822.033333
4       5       0   48566.0       0.000000
5       6       0       0.0       0.000000
6       7       0       0.0       0.000000
7       8       0       0.0       0.000000
8       9       0       0.0       0.000000
9      10       0       0.0       0.000000
10     11       0       0.0       0.000000
11     12       0       0.0       0.000000
12     13       0       0.0       0.000000
13     14       0       0.0       0.000000
14     15       0       0.0       0.000000
15     16       0       0.0       0.000000
16     17       0       0.0       0.000000
17     18       0       0.0       0.000000
18     19       0       0.0       0.000000
19     20       0       0.0       0.00
20      1       1   93738.2   31246.07
21      2       1  189293.2  115354.50
22      3       1  194412.6  180158.70
23      4       1  206585.8  249020.63

I do not know if I am clear in my request, I thank who can help me.

user3483203 · Accepted Answer

New Answer

You were overcomplicating your problem with your conditions, your problem can be reduced to this:

df.groupby(df.active.ne(df.active.shift()).cumsum()).d1.cumsum().div(3) * df.active

0        569.700000
1       1859.600000
2       3269.733333
3       4822.033333
4          0.000000
5          0.000000
6          0.000000
7          0.000000
8          0.000000
9          0.000000
10         0.000000
11         0.000000
12         0.000000
13         0.000000
14         0.000000
15         0.000000
16         0.000000
17         0.000000
18         0.000000
19         0.000000
20     31246.066667
21     94343.800000
22    159148.000000
23    228009.933333
dtype: float64

Old Answer (I think it's still useful to include to explain the logic of what was being attempted)

You only want the cumulative sum of consecutive regions of 1s, however, when you get to the end of your DataFrame, you are continuing with the cumulative sum of the rest of the DataFrame. One solution is calculting the cumsum per group of consecutive 1s, and then using this result in your final check.

I also like to avoid chaining calls to np.where when np.select is much easier to read.

s = df.groupby(df.active.ne(df.active.shift()).cumsum()).d1.cumsum()

c1 = df.month.gt(1) & df.active
c2 = df.month.gt(1) & ~df.active

df.assign(calculate=np.select([c1, c2], [s.div(3), 0], df.d1.div(3)))

    month  active        d1      calculate
0       1       1    1709.1     569.700000
1       2       1    3869.7    1859.600000
2       3       1    4230.4    3269.733333
3       4       1    4656.9    4822.033333
4       5       0   48566.0       0.000000
5       6       0       0.0       0.000000
6       7       0       0.0       0.000000
7       8       0       0.0       0.000000
8       9       0       0.0       0.000000
9      10       0       0.0       0.000000
10     11       0       0.0       0.000000
11     12       0       0.0       0.000000
12     13       0       0.0       0.000000
13     14       0       0.0       0.000000
14     15       0       0.0       0.000000
15     16       0       0.0       0.000000
16     17       0       0.0       0.000000
17     18       0       0.0       0.000000
18     19       0       0.0       0.000000
19     20       0       0.0       0.000000
20      1       1   93738.2   31246.066667
21      2       1  189293.2   94343.800000
22      3       1  194412.6  159148.000000
23      4       1  206585.8  228009.933333

Problem with pandas.DataFrame.cumsum function

Answers (1)

Related Questions