Reputation: 987
I have a dataset of rainfall, with rainfall values being associated with a time (minute) and date on which rainfall occurred, if no rainfall occurred, nothing is logged. As such some days with large storms may have 100 readings, while some days will have none or only one or two. Example below:
D M Y DAY Hr Min Tot
2 3 2002 2419 5 49 0.2
2 3 2002 2419 5 53 0.4
2 3 2002 2419 5 57 0.2
2 3 2002 2419 6 8 0.6
3 3 2002 2420 3 11 0.2
3 3 2002 2420 3 13 0.2
3 3 2002 2420 23 3 0.8
3 3 2002 2420 23 4 0.4
3 3 2002 2420 23 29 0.6
4 3 2002 2421 19 7 0.2
4 3 2002 2421 19 8 0.2
4 3 2002 2421 19 9 0.2
4 3 2002 2421 19 10 0.2
('Tot' is rainfall total)
This dataset is about ten years long, and I need to translate this data into 10 minute, 1 hour and 3 hour totals for correlation with another variable. For example, I need to be able to extract the total rainfall that occurred in the ten minutes 3/3/2002 23:00:00-23:10:00, which comes to 1.2, and then also for the hour from 3/3/2002 23:00:00 which comes to 1.8, and so on.
As I understand it, this is not simply a group by and sum; it would be if I were extracting daily totals (which I have done already), but this seems more complicated than that. Are there any ideas out there as to how I might go about completing this task?
Many thanks for any thoughts and suggestions!
Upvotes: 0
Views: 269
Reputation: 153460
You can use the following to generate your three dataframes/series
to_datetime
df['date'] = pd.to_datetime(df['Y'].astype(str)+'-'+df['M'].astype(str)+'-'+df['D'].astype(str)+' '+df['Hr'].astype(str)+':'+df['Min'].astype(str)+':00')
resample
date with the intervals 10T, 1H and 3H with sum
df_10T = df.resample('10T', on='date')['Tot'].sum()
df_1H = df.resample('1H', on='date')['Tot'].sum()
df_3H = df.resample('3H', on='date')['Tot'].sum()
df_10T[df_10T.ne(0)]
date
2002-03-02 05:40:00 0.2
2002-03-02 05:50:00 0.6
2002-03-02 06:00:00 0.6
2002-03-03 03:10:00 0.4
2002-03-03 23:00:00 1.2
2002-03-03 23:20:00 0.6
2002-03-04 19:00:00 0.6
2002-03-04 19:10:00 0.2
Name: Tot, dtype: float64
df_1H[df_1H.ne(0)]
date
2002-03-02 05:00:00 0.8
2002-03-02 06:00:00 0.6
2002-03-03 03:00:00 0.4
2002-03-03 23:00:00 1.8
2002-03-04 19:00:00 0.8
Name: Tot, dtype: float64
df_3H[df_3H.ne(0)]
date
2002-03-02 03:00:00 0.8
2002-03-02 06:00:00 0.6
2002-03-03 03:00:00 0.4
2002-03-03 21:00:00 1.8
2002-03-04 18:00:00 0.8
Name: Tot, dtype: float64
Upvotes: 2
Reputation: 927
First thing to do is to convert your dates to datetime objects. After that you can slice your dataframe with timedeltas.
import datetime
import pandas as pd
import numpy as np
df = pd.read_clipboard()
timestep = datetime.timedelta(hours=3)
dates = [datetime.datetime(Y, M, D, h, m) for Y, M, D, h, m in zip(df.Y, df.M, df.D, df.Hr, df.Min)]
df.index = dates
start = df.index.min()
steps = int(np.ceil((df.index.max() - start) / timestep))
bins = [start + i * timestep for i in range(steps)]
data = [df[bins[i]: bins[i+1]].Tot.sum() for i in range(int(steps-1))]
result = pd.DataFrame({"Total": data}, bins[:-1])
Upvotes: 0