pandas: calculate the average time and standard deviation of unique values of column

Question

assuming I have a list of processes A,B,C,D.... with times

Process time
A        3
A        4
B        5
C        6
B        7
A        4
A        5
B        2
C        3

import pandas as pd
mydf = pd.DataFrame(columns=['process','time'],
 data={'process':list('ABCABCABCDAABB'),'time':[2,3,4,5,6,4,5,6,4,5,6,2,3,4]})

What I want to extract is a Dataframe with only one entry per process per row and two columns, the average time and the standard deviation.

I solved the problem as follows

processes=mydf.process.unique()
dfstats = pd.DataFrame(columns=['P','average','std'])

for process in processes:
    mask = mydf.process == process
    average,std = mydf.loc[mask,['time']]['time'].mean(),mydf.loc[mask,['time']]['time'].std()
    dfstats.loc[len(dfstats)] =[process,average,std]

I have the feeling that there should be a straight way to do this. my solution is too verbose, too long, and probably too slow (I am dealing with a couple of million rows)

any other more pythonic-pandonic solution?

thanks

anon01 · Accepted Answer

You are finding the statistics of each group. This can be done via groupby and agg:

df.groupby("process").agg(["mean", "std"])

output:

        mean       std
process
A        4.0  1.870829
B        4.4  1.516575
C        4.0  0.000000
D        5.0       NaN

pandas: calculate the average time and standard deviation of unique values of column

Answers (2)

Related Questions