Mr. Engineer
Mr. Engineer

Reputation: 375

Pandas: compute average and standard deviation by clock time

I have a DataFrame like this:

        date             time         value
0     2019-04-18         07:00:10      100.8
1     2019-04-18         07:00:20      95.6
2     2019-04-18         07:00:30      87.6
3     2019-04-18         07:00:40      94.2

The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:

df["hour"] = df["time"].astype(str).str[:2]

Then I have tried to calculate standard deviation by:

df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()

But that won't work, could I have some advise on the problem?

Upvotes: 2

Views: 1353

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:

hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])

If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:

g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')

         date      time  value   mean       std
0  2019-04-18  07:00:10  100.8  94.55  5.434151
1  2019-04-18  07:00:20   95.6  94.55  5.434151
2  2019-04-18  07:00:30   87.6  94.55  5.434151
3  2019-04-18  07:00:40   94.2  94.55  5.434151

Upvotes: 4

Rob Raymond
Rob Raymond

Reputation: 31166

  • have synthesized data. Start by generating a true datetime column
  • groupby() hour
  • use describe() to get mean & std
  • merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})

df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
              time=df.datetime.dt.strftime("%H:%M:%S"))

# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)

# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
 .apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
 .droplevel(1)
)

# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
    datetime    value   mean    std
0   2019-01-01 00:00:00 86.014209   80.043364   5.777724
1   2019-01-01 00:00:10 77.241141   80.043364   5.777724
2   2019-01-01 00:00:20 71.650739   80.043364   5.777724
3   2019-01-01 00:00:30 71.066332   80.043364   5.777724
4   2019-01-01 00:00:40 77.203291   80.043364   5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237   80.009751   5.773007
3144956 2019-12-30 23:59:20 82.154883   80.009751   5.773007
3144957 2019-12-30 23:59:30 82.131952   80.009751   5.773007
3144958 2019-12-30 23:59:40 85.346724   80.009751   5.773007
3144959 2019-12-30 23:59:50 78.122761   80.009751   5.773007

Upvotes: 1

Related Questions