calculate statistics and/or a function on multiple columns in subsets of a dataframe in R

Question

I have a dataframe (sample of the following form):

DateTime                Ind1  Ind2  V1  V2  Ac1  Ac2  w1  w2   w3  shift
2016-05-01 00:01:00      U     A    5   7   20   100  50  70   200   1
2016-05-01 00:01:20      U     A    5   7   20   109  35  77   140   1
2016-05-01 00:01:40      U     A    5   7   40   120  55  97   160   1
...
2016-05-01 00:08:20      U     A    5   7   15   157  70  70   204   2
...
2016-05-02 00:08:20      U     A    5   7   28   147  65  90   240   2
...
2016-05-02 00:20:00      U     A    5   7   35   210  45  100  167   3

I need a new dataframe where some statistics (e.g. mean, standard deviation) for the columns v1 to w3 are listed for each date-and-shift combination, something similar to the following:

Date      shift   Ind1  Ind2  avgV1  sdV1   avgV2 sdV2  avgAC1 ....   
2016-05-01   1      U     A    5.3    2.9    7.8   4.5   108  .....
2016-05-01   2      U     A    6.7    3.5    8.9   5.0   99 .....

SOLUTION TRIED:
I can do the following steps.

1) extract date from DateTime

df$Date <- format(as.POSIXct(df$DateTime, format="%Y-%m-%d %H:%M:%S"), format="%Y-%m-%d")

2) label the data by date and shift.

df$DateShift <- paste(df$Date, df$shift)

3) for each subset, calculate some statistics on a col:

tmp_df <- data.frame(levels(as.factor(df$DateShift)))
avgV1 <- tapply(df$V1, df$DateShift, FUN=mean) 
sdV1 <- tapply(df$V1, df$DateShift, FUN=sd)
avgV2<- tapply(df$V2, df$DateShift, FUN=mean) 
....

However, I have more than 50 columns in the original dataframe, with different types of names (not as simple as in the example above).
Moreover, the statistics that I want to compute may vary (say, calculation of max and min, or some other user-defined function).

So I don't want to code by hand for the different combinations of columns and type of statistic (mean, standard dev, etc.)
What is the way to automate this?

Remko Duursma · Accepted Answer

I am sure the dplyr solutions are coming, but the doBy package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.

library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)

Will give a dataframe with V1.mean, V1.median, and so on. The . ~ means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2, for example.

calculate statistics and/or a function on multiple columns in subsets of a dataframe in R

Answers (2)

Related Questions