Reputation: 168
I have the following dataframe in python:
month = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4]
active = [1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1]
data1 = [1709.1,3869.7,4230.4,4656.9,48566.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,93738.2,189293.2,194412.6,206585.8]
df = pd.DataFrame({
'month' : month,
'active' : active,
'd1' : data1,
'calculate' : 0,
});
and I want to calculate the column 'calculate', in the following way:
month active d1 calculate 0 1 1 1709.1 569.70 1 2 1 3869.7 1859.60 2 3 1 4230.4 3269.73 3 4 1 4656.9 4822.03 4 5 0 48566.0 0.00 5 6 0 0.0 0.00 6 7 0 0.0 0.00 7 8 0 0.0 0.00 8 9 0 0.0 0.00 9 10 0 0.0 0.00 10 11 0 0.0 0.00 11 12 0 0.0 0.00 12 13 0 0.0 0.00 13 14 0 0.0 0.00 14 15 0 0.0 0.00 15 16 0 0.0 0.00 16 17 0 0.0 0.00 17 18 0 0.0 0.00 18 19 0 0.0 0.00 19 20 0 0.0 0.00 20 1 1 93738.2 31246.07 21 2 1 189293.2 94343.80 22 3 1 194412.6 159148.00 23 4 1 206585.8 228009.93
I am doing it in the following way:
df['calculate'] = np.where(
df.month > 1,
np.where(
df.active,
(df.d1/3).cumsum(),
0,
),
(df['d1']/3)
)
but the result is not what was expected:
month active d1 calculate 0 1 1 1709.1 569.700000 1 2 1 3869.7 1859.600000 2 3 1 4230.4 3269.733333 3 4 1 4656.9 4822.033333 4 5 0 48566.0 0.000000 5 6 0 0.0 0.000000 6 7 0 0.0 0.000000 7 8 0 0.0 0.000000 8 9 0 0.0 0.000000 9 10 0 0.0 0.000000 10 11 0 0.0 0.000000 11 12 0 0.0 0.000000 12 13 0 0.0 0.000000 13 14 0 0.0 0.000000 14 15 0 0.0 0.000000 15 16 0 0.0 0.000000 16 17 0 0.0 0.000000 17 18 0 0.0 0.000000 18 19 0 0.0 0.000000 19 20 0 0.0 0.00 20 1 1 93738.2 31246.07 21 2 1 189293.2 115354.50 22 3 1 194412.6 180158.70 23 4 1 206585.8 249020.63
I do not know if I am clear in my request, I thank who can help me.
Upvotes: 1
Views: 633
Reputation: 51175
New Answer
You were overcomplicating your problem with your conditions, your problem can be reduced to this:
df.groupby(df.active.ne(df.active.shift()).cumsum()).d1.cumsum().div(3) * df.active
0 569.700000
1 1859.600000
2 3269.733333
3 4822.033333
4 0.000000
5 0.000000
6 0.000000
7 0.000000
8 0.000000
9 0.000000
10 0.000000
11 0.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 0.000000
17 0.000000
18 0.000000
19 0.000000
20 31246.066667
21 94343.800000
22 159148.000000
23 228009.933333
dtype: float64
Old Answer (I think it's still useful to include to explain the logic of what was being attempted)
You only want the cumulative sum of consecutive regions of 1s, however, when you get to the end of your DataFrame, you are continuing with the cumulative sum of the rest of the DataFrame. One solution is calculting the cumsum
per group of consecutive 1s, and then using this result in your final check.
I also like to avoid chaining calls to np.where
when np.select
is much easier to read.
s = df.groupby(df.active.ne(df.active.shift()).cumsum()).d1.cumsum()
c1 = df.month.gt(1) & df.active
c2 = df.month.gt(1) & ~df.active
df.assign(calculate=np.select([c1, c2], [s.div(3), 0], df.d1.div(3)))
month active d1 calculate
0 1 1 1709.1 569.700000
1 2 1 3869.7 1859.600000
2 3 1 4230.4 3269.733333
3 4 1 4656.9 4822.033333
4 5 0 48566.0 0.000000
5 6 0 0.0 0.000000
6 7 0 0.0 0.000000
7 8 0 0.0 0.000000
8 9 0 0.0 0.000000
9 10 0 0.0 0.000000
10 11 0 0.0 0.000000
11 12 0 0.0 0.000000
12 13 0 0.0 0.000000
13 14 0 0.0 0.000000
14 15 0 0.0 0.000000
15 16 0 0.0 0.000000
16 17 0 0.0 0.000000
17 18 0 0.0 0.000000
18 19 0 0.0 0.000000
19 20 0 0.0 0.000000
20 1 1 93738.2 31246.066667
21 2 1 189293.2 94343.800000
22 3 1 194412.6 159148.000000
23 4 1 206585.8 228009.933333
Upvotes: 1