Pandas preserve original columns when calculating mean

Question

My dataset df looks like this. It is a minute based dataset

time, Open, High
2017-01-01 00:00:00, 1.2432, 1.1234
2017-01-01 00:01:00, 1.2432, 1.1234
2017-01-01 00:02:00, 1.2332, 1.1234
2017-01-01 00:03:00, 1.2132, 1.1234
...., ...., ....
2017-12-31 23:59:00, 1.2132, 1.1234

I did the following to calculate the hourly mean from the above minute dataset

df['time'] = pd.to_datetime(df['time'])
df.index = df['time']
df_mean = df.resample('H').mean()

I then load df_mean and I get hourly value:

time,                 Open       High   
2017-01-01 00:00:00 1.051488    1.051500     
2017-01-01 01:00:00 1.051247    1.051275     
2017-01-01 02:00:00 1.051890    1.051957     
2017-01-01 03:00:00 1.051225    1.051290     
...., ...., ....
2017-12-31 23:00:00 1.051225    1.051290

but what I also want is the original Open value and High value

I need help with two things:

Once the mean is calculated, I want it to be labeled Open_Mean and High_Mean
Since the hourly mean gives time in hour basis(eg: 2017-01-01 01:00:00), I want to load the original Open and High values at that time.

Here:

Open and High values are the same for the particular timestamp form the original dataset but Open_Mean and High_Mean is the calculated hourly mean for that timestamp

The new df should look like this:

time,                 Open          High     Open_Mean   High_Mean    
2017-01-01 00:00:00   1.051488    1.051500   1.051500    1.051500 
2017-01-01 01:00:00   1.051247    1.051275   1.051500    1.051500
2017-01-01 02:00:00   1.051890    1.051957   1.051500    1.051500  
2017-01-01 03:00:00   1.051225    1.051290   1.051500    1.051500  
...., ...., ...., ...., ....
2017-12-31 23:00:00   1.051225    1.051290   1.051500    1.051500

Once we get the correct dataset in df I want to filter the new df to load only time specific data.

For eg: load data for time range form 10 PM - 4 PM every day. Currently it load all hours.

Quang Hoang · Accepted Answer

Use add_suffix to rename

df['time'] = pd.to_datetime(df['time'])

df_mean = df.set_index('time').resample('H').mean()

df_mean = df_mean.add_suffix('_Mean')

and merge with how='inner' to extract the data on the hours:

df.merge(df_mean, left_on='time', right_index=True, how='inner')

Output (head of random data);

                    time      Open      High  Open_Mean  High_Mean
0    2017-01-01 00:00:00  1.219690  1.693049   1.519751   2.042550
60   2017-01-01 01:00:00  1.688490  1.404521   1.526833   2.115046
120  2017-01-01 02:00:00  1.015285  2.653544   1.533529   1.797564
180  2017-01-01 03:00:00  1.357672  2.299571   1.506012   2.043484
240  2017-01-01 04:00:00  1.293786  2.312414   1.489759   2.131644
300  2017-01-01 05:00:00  1.040048  2.791968   1.438353   1.816585
360  2017-01-01 06:00:00  1.225080  1.505802   1.473208   2.193237
420  2017-01-01 07:00:00  1.145402  3.217261   1.481710   1.914683

Pandas preserve original columns when calculating mean

Answers (1)

Related Questions