Reputation: 1
I have a data frame called subdata, with a dimension of 10299 x 81. Column 1 called "Subject" and column 2 called "Activity". I want to calculate the average of each column grouped by "Subject" and "Activity".
Here are the functions I tried and none of them seems work so far. Finally I used colwise(mean) function, it seems work. I am new to R and just learned sapply
, lapply
, tapply
functions and it seems mean function works in columns.
Can anyone help me explain what does these error or warning message mean and if there a way to make theses functions work?
Use lapply function:
newdata<- subdata[, lapply(.SD, mean), by = c("Subject","Activity")]
The error message:
Error in `[.data.frame`(subdata, , lapply(.SD, mean), by = c("Subject", :
unused argument (by = c("Subject", "Activity"))
Use by function:
newdata<-by(subdata, list(subdata$Subject, subdata$Activity), mean)
I got warning message:
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
Then I tried ddply in plyr package
ddply(subdata, .(Subject, Activity), mean)
I got the same warning message:
Warning messages:
1: In mean.default(piece, ...) : argument is not numeric or logical: returning NA 0
Finally I used colwise(mean)function, it seems work
newdata<-ddply(subdata, .(Subject, Activity), colwise(mean))
Upvotes: 0
Views: 1896
Reputation: 19960
It is somewhat difficult to be certain with a representative sample of your dataset. Let's create some data to work with.
# Create some random demo data
subdata <- data.frame(Subject = rep(seq(5), each=4),
Activity = rep(LETTERS[1:2], 10), v1=rnorm(20), v2=rnorm(20))
Your first attempt I am not even sure where to start. It appears you are trying to subset your dataframe with the output of a list which already seems strange. You should abandon this attempt.
Your by
statement is providing an error about non-numeric data. This is because the by
function isn't that smart. You need to only provide the columns to be analyzed and then the indices (i.e. your factor columns).
by(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), function(x) colMeans(x))
Althought you probably want to rbind
this output and reassign rownames to correspond to groups. However, for this purpose it may be best to just use something aggregate
to avoid such extra computation.
aggregate(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), mean)
Your ddply
statements are close but as I suggested above you should use numcolwise
to summarize over your numeric
columns.
library(plyr)
# summarize over all numeric columns
ddply(subdata, .(Subject, Activity), numcolwise(mean))
Upvotes: 1