Rachelanne2251
Rachelanne2251

Reputation: 43

How do I take the mean on either side of a value in a pandas DataFrame?

I have a Pandas DataFrame where the index is datetimes for every 12 minutes in a day (120 rows total). I went ahead and resampled the data to every 30 minutes.

                 Time  Rain_Rate
1 2014-04-02 00:00:00       0.50
2 2014-04-02 00:30:00       1.10
3 2014-04-02 01:00:00       0.48
4 2014-04-02 01:30:00       2.30
5 2014-04-02 02:00:00       4.10
6 2014-04-02 02:30:00       5.00
7 2014-04-02 03:00:00       3.20

I want to take 3 hour means centered on hours 00, 03, 06, 09, 12, 15 ,18, and 21. I want the mean to consist of 1.5 hours before 03:00:00 (so 01:30:00) and 1.5 hours after 03:00:00 (04:30:00). The 06:00:00 time would overlap with the 03:00:00 average (they would both use 04:30:00). Is there a way to do this using pandas? I've tried a few things but they haven't worked.

Upvotes: 0

Views: 391

Answers (1)

Tom
Tom

Reputation: 8800

Method 1

I'm going to suggest just change your resample from the get-go to get the chunks you want. Here's some fake data resembling yours, before resampling at all:

dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'
#df.head()
                     Rain_Rate
Time                          
2014-04-02 00:00:00   0.616588
2014-04-02 00:12:00   0.201390
2014-04-02 00:24:00   0.802754
2014-04-02 00:36:00   0.712743
2014-04-02 00:48:00   0.711766

Averaging by 3 hour chunks initially will be the same as doing 30 minute chunks then doing 3 hour chunks. You just have to tweak a couple things to get the right bins you want. First you can add the bin you will start from (i.e. 10:30 pm on the previous day, even if there's no data there; the first bin is from 10:30pm - 1:30am), then resample starting from this point

before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()

output = df.resample('3H', base=22.5, loffset='90min').mean()

The base parameter here means start at the 22.5th hour (10:30), and loffset means push the bin names back by 90 minutes. You get the following output:

                     Rain_Rate
Time                          
2014-04-02 00:00:00   0.555515
2014-04-02 03:00:00   0.546571
2014-04-02 06:00:00   0.439953
2014-04-02 09:00:00   0.460898
2014-04-02 12:00:00   0.506690
2014-04-02 15:00:00   0.605775
2014-04-02 18:00:00   0.448838
2014-04-02 21:00:00   0.387380
2014-04-03 00:00:00   0.604204  #this is the bin at midnight on the following day

You could also start with the data binned at 30 minutes and use this method, and should get the same answer.*

Method 2

Another approach would be to find the locations of the indexes you want to create averages for, and then calculate the averages for entries in the 3 hours surrounding:

resampled = df.resample('30T',).mean() #like your data in the post

centers = [0,3,6,9,12,15,18,21]

mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]

output = []
for center in df_centers:
    cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
    cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
    output.append(df[cond1 & cond2].values.mean())

Output here is the same, but the answers are in a list (and the last point of "24 hours" is not included):

[0.5555146139562004,
 0.5465709237162698,
 0.43995277270996735,
 0.46089800625663596,
 0.5066902552121085,
 0.6057747262752732,
 0.44883794039466535,
 0.3873795731806939]

*You mentioned you wanted some points on the edge of bins to be included in both bins. resample doesn't do this (and generally I don't think most people want to do so), but the second method I used is explicit about doing so (by using >= and <= in cond1 and cond2). However, these two methods achieve the same result here, presumably b/c of the use of resample at different stages causing data points to be included in different bins. It's hard for me to wrap my around that, but one could do a little manual binning to verify what is going on. The point is, I would recommend spot-checking the output of these methods (or any resample-based method) against your raw data to make sure things look correct. For these examples, I did so using Excel.

Upvotes: 2

Related Questions