Reputation: 77
I have a dataframe (sample of the following form):
DateTime Ind1 Ind2 V1 V2 Ac1 Ac2 w1 w2 w3 shift
2016-05-01 00:01:00 U A 5 7 20 100 50 70 200 1
2016-05-01 00:01:20 U A 5 7 20 109 35 77 140 1
2016-05-01 00:01:40 U A 5 7 40 120 55 97 160 1
...
2016-05-01 00:08:20 U A 5 7 15 157 70 70 204 2
...
2016-05-02 00:08:20 U A 5 7 28 147 65 90 240 2
...
2016-05-02 00:20:00 U A 5 7 35 210 45 100 167 3
I need a new dataframe where some statistics (e.g. mean, standard deviation) for the columns v1 to w3 are listed for each date-and-shift combination, something similar to the following:
Date shift Ind1 Ind2 avgV1 sdV1 avgV2 sdV2 avgAC1 ....
2016-05-01 1 U A 5.3 2.9 7.8 4.5 108 .....
2016-05-01 2 U A 6.7 3.5 8.9 5.0 99 .....
SOLUTION TRIED:
I can do the following steps.
1) extract date from DateTime
df$Date <- format(as.POSIXct(df$DateTime, format="%Y-%m-%d %H:%M:%S"), format="%Y-%m-%d")
2) label the data by date and shift.
df$DateShift <- paste(df$Date, df$shift)
3) for each subset, calculate some statistics on a col:
tmp_df <- data.frame(levels(as.factor(df$DateShift)))
avgV1 <- tapply(df$V1, df$DateShift, FUN=mean)
sdV1 <- tapply(df$V1, df$DateShift, FUN=sd)
avgV2<- tapply(df$V2, df$DateShift, FUN=mean)
....
However, I have more than 50 columns in the original dataframe, with different types of names (not as simple as in the example above).
Moreover, the statistics that I want to compute may vary (say, calculation of max and min, or some other user-defined function).
So I don't want to code by hand for the different combinations of columns and type of statistic (mean, standard dev, etc.)
What is the way to automate this?
Upvotes: 1
Views: 207
Reputation: 468
library(dplyr)
df %>%
mutate(Date = as.Date(DateTime)) %>%
group_by(Date, shift) %>%
summarise_each(funs(mean))
Upvotes: 0
Reputation: 2821
I am sure the dplyr
solutions are coming, but the doBy
package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.
library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)
Will give a dataframe with V1.mean
, V1.median
, and so on.
The . ~
means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2
, for example.
Upvotes: 1