Reputation: 13
I am using cbind
to find the mean of 3 different columns. However I get different answers for the means when I do:
DFNEW <- aggregate(cbind(X1, X2, X3)~Y, DF, FUN=mean)
vs
DFNEW <- aggregate(cbind(X1, X2)~Y, DF, FUN=mean)
The means of X1 and X2 are different when I run command 1 and when I run command 2. X1, X2, and X3 all have different numbers of NA arguments- is that the reason? Part of this may also be that I'm not entirely sure what cbind
is doing in this case.
Upvotes: 0
Views: 975
Reputation: 24613
One can also use split command:
> sapply(split(DF, DF$Y), function(x) sapply(x[1:3], mean, na.rm=T))
A B C
X1 3.625 3.00 4.142857
X2 6.375 6.75 7.166667
X3 2.875 1.50 1.333333
(Using @akrun's data)
Upvotes: 0
Reputation: 3501
The beauty of open source is diversity and array of tools available. Quite a fun. Here is a data.table
alternative for your reference. Use @akrun's data.
library(data.table)
setDT(DF)[, lapply(.SD, mean, na.rm=T), by=Y]
# Y X1 X2 X3
# 1: C 4.142857 7.166667 1.333333
# 2: A 3.625000 6.375000 2.875000
# 3: B 3.000000 6.750000 1.500000
setDT(DF)[, lapply(.SD, mean, na.rm=T), by=Y, .SDcols=1:2]
# Y X1 X2
# 1: C 4.142857 7.166667
# 2: A 3.625000 6.375000
# 3: B 3.000000 6.750000
Upvotes: 2
Reputation: 887741
I guess the reason why you are getting different result is because by default, na.action=na.omit
for the formula
interface. So, the rows with NA
s are omitted and not used for the calculation of mean
. When we use different combinations of columns, different rows could be deleted based on the occurrence of NA
. By specifying na.action=NULL
, the rows will not get deleted and we can remove the NA
values while calculating the mean
by using the argument na.rm=TRUE
in the mean
function.
aggregate(cbind(X1, X2)~Y, DF, FUN=mean, na.rm=TRUE, na.action=NULL)
#Y X1 X2
#1 A 3.625000 6.375000
#2 B 3.000000 6.750000
#3 C 4.142857 7.166667
aggregate(cbind(X1, X2, X3)~Y, DF, FUN=mean, na.rm=TRUE, na.action=NULL)
# Y X1 X2 X3
#1 A 3.625000 6.375000 2.875000
#2 B 3.000000 6.750000 1.500000
#3 C 4.142857 7.166667 1.333333
The results we get from the above will be the same below i.e. without using the formula
interface
aggregate(DF[,c("X1", "X2")], list(Y=DF[,"Y"]), FUN= mean, na.rm=TRUE)
# Y X1 X2
# 1 A 3.625000 6.375000
# 2 B 3.000000 6.750000
# 3 C 4.142857 7.166667
aggregate(DF[,c("X1", "X2", "X3")], list(Y=DF[,"Y"]), FUN= mean, na.rm=TRUE)
# Y X1 X2 X3
#1 A 3.625000 6.375000 2.875000
#2 B 3.000000 6.750000 1.500000
#3 C 4.142857 7.166667 1.333333
If you want some alternatives, you could use dplyr
library(dplyr)
DF %>%
group_by(Y) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
# Source: local data frame [3 x 4]
# Y X1 X2 X3
#1 A 3.625000 6.375000 2.875000
#2 B 3.000000 6.750000 1.500000
#3 C 4.142857 7.166667 1.333333
DF %>%
group_by(Y) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), X1, X2)
#Source: local data frame [3 x 3]
# Y X1 X2
#1 A 3.625000 6.375000
#2 B 3.000000 6.750000
#3 C 4.142857 7.166667
set.seed(42)
DF <- data.frame(X1=sample(c(NA, 1:5), 25, replace=TRUE),
X2= sample(c(NA, 1:10), 25, replace=TRUE), X3= sample(c(NA,0:5), 25,
replace=TRUE), Y=sample(LETTERS[1:3], 25, replace=TRUE))
Upvotes: 2