SHV_la
SHV_la

Reputation: 987

Extract and sum totals for time period

I have a dataset of rainfall, with rainfall values being associated with a time (minute) and date on which rainfall occurred, if no rainfall occurred, nothing is logged. As such some days with large storms may have 100 readings, while some days will have none or only one or two. Example below:

D   M   Y       DAY     Hr  Min   Tot
2   3   2002    2419    5   49    0.2
2   3   2002    2419    5   53    0.4
2   3   2002    2419    5   57    0.2
2   3   2002    2419    6   8     0.6
3   3   2002    2420    3   11    0.2
3   3   2002    2420    3   13    0.2
3   3   2002    2420    23  3     0.8
3   3   2002    2420    23  4     0.4
3   3   2002    2420    23  29    0.6
4   3   2002    2421    19  7     0.2
4   3   2002    2421    19  8     0.2
4   3   2002    2421    19  9     0.2
4   3   2002    2421    19  10    0.2
('Tot' is rainfall total)

This dataset is about ten years long, and I need to translate this data into 10 minute, 1 hour and 3 hour totals for correlation with another variable. For example, I need to be able to extract the total rainfall that occurred in the ten minutes 3/3/2002 23:00:00-23:10:00, which comes to 1.2, and then also for the hour from 3/3/2002 23:00:00 which comes to 1.8, and so on.

As I understand it, this is not simply a group by and sum; it would be if I were extracting daily totals (which I have done already), but this seems more complicated than that. Are there any ideas out there as to how I might go about completing this task?

Many thanks for any thoughts and suggestions!

Upvotes: 0

Views: 269

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

You can use the following to generate your three dataframes/series

Create a column with datetime dtype using to_datetime

df['date'] = pd.to_datetime(df['Y'].astype(str)+'-'+df['M'].astype(str)+'-'+df['D'].astype(str)+' '+df['Hr'].astype(str)+':'+df['Min'].astype(str)+':00')

resample date with the intervals 10T, 1H and 3H with sum

df_10T = df.resample('10T', on='date')['Tot'].sum()
df_1H = df.resample('1H', on='date')['Tot'].sum()
df_3H = df.resample('3H', on='date')['Tot'].sum()

Display non-zero values of each series for checking.

10 Minutes

df_10T[df_10T.ne(0)]
date
2002-03-02 05:40:00    0.2
2002-03-02 05:50:00    0.6
2002-03-02 06:00:00    0.6
2002-03-03 03:10:00    0.4
2002-03-03 23:00:00    1.2
2002-03-03 23:20:00    0.6
2002-03-04 19:00:00    0.6
2002-03-04 19:10:00    0.2
Name: Tot, dtype: float64

1 hour

df_1H[df_1H.ne(0)]
date
2002-03-02 05:00:00    0.8
2002-03-02 06:00:00    0.6
2002-03-03 03:00:00    0.4
2002-03-03 23:00:00    1.8
2002-03-04 19:00:00    0.8
Name: Tot, dtype: float64

3 hour

df_3H[df_3H.ne(0)]
date
2002-03-02 03:00:00    0.8
2002-03-02 06:00:00    0.6
2002-03-03 03:00:00    0.4
2002-03-03 21:00:00    1.8
2002-03-04 18:00:00    0.8
Name: Tot, dtype: float64

Upvotes: 2

pask
pask

Reputation: 927

First thing to do is to convert your dates to datetime objects. After that you can slice your dataframe with timedeltas.

import datetime
import pandas as pd
import numpy as np

df = pd.read_clipboard()

timestep = datetime.timedelta(hours=3)

dates = [datetime.datetime(Y, M, D, h, m) for Y, M, D, h, m in zip(df.Y, df.M, df.D, df.Hr, df.Min)]

df.index = dates

start = df.index.min()
steps = int(np.ceil((df.index.max() - start) / timestep))

bins = [start + i * timestep for i in range(steps)]
data = [df[bins[i]: bins[i+1]].Tot.sum() for i in range(int(steps-1))]

result = pd.DataFrame({"Total": data}, bins[:-1])

enter image description here

Upvotes: 0

Related Questions