Reputation: 53
I am a novice user of Pandas. I have a dataframe that looks like this:
days rainfall
1 3.51
2 1.32
3 0
4 0
5 0
6 0
7 0
8 0
9 0.03
10 0
11 0
12 0.17
13 0.23
14 0.02
15 0
16 0
17 0
18 0.03
19 0.02
20 0
21 0
I would like to add a column (let's call it 'cumulative') that shows the cumulative rainfall values for every week. In other words, I want to calculate the cumulative values for the first seven days (1-7), then the second set of seven days (8-14), and so on.
The end product would look like this:
days rainfall cumulative
1 3.51 4.83
2 1.32 0.45
3 0 0.05
4 0
5 0
6 0
7 0
8 0
9 0.03
10 0
11 0
12 0.17
13 0.23
14 0.02
15 0
16 0
17 0
18 0.03
19 0.02
20 0
21 0
So far I've tried calling rolling
with sum
but I do not get what I want.
df['cumulative']=df['rainfall'].rolling(min_periods=7, window=7).sum()
Grateful for any tips or advice!
Upvotes: 2
Views: 2430
Reputation: 42916
If I understand you correctly you want GroupBy.transform
:
# create groups of each 7 days with floordivision
grps = df['days'].sub(1).floordiv(7)
# get the cumulative sum per group
df['cumsum'] = df.groupby(grps)['rainfall'].transform('sum')
days rainfall cumsum
0 1 3.51 4.83
1 2 1.32 4.83
2 3 0.00 4.83
3 4 0.00 4.83
4 5 0.00 4.83
5 6 0.00 4.83
6 7 0.00 4.83
7 8 0.00 0.45
8 9 0.03 0.45
9 10 0.00 0.45
10 11 0.00 0.45
11 12 0.17 0.45
12 13 0.23 0.45
13 14 0.02 0.45
14 15 0.00 0.05
15 16 0.00 0.05
16 17 0.00 0.05
17 18 0.03 0.05
18 19 0.02 0.05
19 20 0.00 0.05
20 21 0.00 0.05
Upvotes: 1
Reputation: 808
EDIT: Another method that works without DateTime indices is pd.cut().
df.groupby(pd.cut(df.days, bins=3,
precision=0))["rainfall"].sum()
days
(1.0, 8.0] 4.83
(8.0, 14.0] 0.45
(14.0, 21.0] 0.05
The cut method allows you to specify a frequency range to bin values.
pd.cut(df.days, bins=3)
is a way of saying "take the Series df["days"] and split it into three chunks". If you run that code alone, you see:
0 (1.0, 8.0]
1 (1.0, 8.0]
2 (1.0, 8.0]
.
.
.
19 (14.0, 21.0]
20 (14.0, 21.0]
It's labeling each row in your DataFrame with what bin it belongs in. You can then use that as an argument in a groupby statement, just like any other column attribute, and apply an aggregate function.
Putting ["rainfall"] outside the groupby statement is a way of saying, "this is the column I want the sum of" (i.e., don't sum the days). You could alternately write it first, if that's more intuitive. (It's great, and also frustrating, that pandas has a lot more than one and only one right way to do things.)
df["rainfall"].groupby(...)
ORIGINAL ANSWER:
For aggregate statistics, you can use pd.resample(). It's a DateTime index method (I had to coerce it a bit here, but usually you'll have more to go on with weather timestamps).
df.resample("W").sum()["rainfall"]
is the code to downsample days to weeks and aggregate values.
In this case, I constructed a DataFrame from a dictionary and cast the index to DateTime format to use the resample method:
df = pd.DataFrame(
data={
"days": (list(range(1,22))),
"rainfall": [3.51,
1.32, 0, 0, 0, 0, 0, 0, 0.03,
0, 0, 0.17, 0.23, 0.02, 0, 0,
0, 0.03, 0.02, 0, 0]},
index=pd.to_datetime(list(range(1,22)), format="%d",
errors="coerce"))
That gets you:
1900-01-07 4.83
1900-01-14 0.45
1900-01-21 0.05
Freq: W-SUN, Name: rainfall, dtype: float64
Again, you'd want to adjust the year and month as appropriate, but the nice thing about resample is that you can easily aggregate by predefined time intervals (week, days, minutes, etc.) and custom spans.
Upvotes: 0
Reputation: 59701
You can do that like this:
import pandas as pd
df = pd.DataFrame([
[ 1, 3.51],
[ 2, 1.32],
[ 3, 0],
[ 4, 0],
[ 5, 0],
[ 6, 0],
[ 7, 0],
[ 8, 0],
[9, 0.03],
[10, 0],
[11, 0],
[12, 0.17],
[13, 0.23],
[14, 0.02],
[15, 0],
[16, 0],
[17, 0],
[18, 0.03],
[19, 0.02],
[20, 0],
[21, 0]], columns=['days', 'rainfall'])
result = df['rainfall'].groupby((df['days'] - 1) // 7).sum().reset_index(drop=True)
print(result)
# In [418]: %paste -q
# 0 4.83
# 1 0.45
# 2 0.05
# Name: rainfall, dtype: float64
Upvotes: 1