user4275832
user4275832

Reputation: 1

R function applied on data frame grouped by multiple factors

I have a data frame called subdata, with a dimension of 10299 x 81. Column 1 called "Subject" and column 2 called "Activity". I want to calculate the average of each column grouped by "Subject" and "Activity".

Here are the functions I tried and none of them seems work so far. Finally I used colwise(mean) function, it seems work. I am new to R and just learned sapply, lapply, tapply functions and it seems mean function works in columns.

Can anyone help me explain what does these error or warning message mean and if there a way to make theses functions work?

Use lapply function:

newdata<- subdata[, lapply(.SD, mean), by = c("Subject","Activity")]

The error message:

Error in `[.data.frame`(subdata, , lapply(.SD, mean), by = c("Subject",  : 
unused argument (by = c("Subject", "Activity"))

Use by function:

newdata<-by(subdata, list(subdata$Subject, subdata$Activity), mean)

I got warning message:

Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
   argument is not numeric or logical: returning NA

Then I tried ddply in plyr package

ddply(subdata, .(Subject, Activity), mean)

I got the same warning message:

Warning messages:
1: In mean.default(piece, ...) : argument is not numeric or logical: returning NA 0

Finally I used colwise(mean)function, it seems work

newdata<-ddply(subdata, .(Subject, Activity), colwise(mean))

Upvotes: 0

Views: 1896

Answers (1)

cdeterman
cdeterman

Reputation: 19960

It is somewhat difficult to be certain with a representative sample of your dataset. Let's create some data to work with.

# Create some random demo data
subdata <- data.frame(Subject = rep(seq(5), each=4), 
                     Activity = rep(LETTERS[1:2], 10), v1=rnorm(20), v2=rnorm(20))

Your first attempt I am not even sure where to start. It appears you are trying to subset your dataframe with the output of a list which already seems strange. You should abandon this attempt.

Your by statement is providing an error about non-numeric data. This is because the by function isn't that smart. You need to only provide the columns to be analyzed and then the indices (i.e. your factor columns).

by(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), function(x) colMeans(x))

Althought you probably want to rbind this output and reassign rownames to correspond to groups. However, for this purpose it may be best to just use something aggregate to avoid such extra computation.

aggregate(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), mean)

Your ddply statements are close but as I suggested above you should use numcolwise to summarize over your numeric columns.

library(plyr)
# summarize over all numeric columns
ddply(subdata, .(Subject, Activity), numcolwise(mean))

Upvotes: 1

Related Questions