Reputation: 719
I can't imagine I'm the first person with this question, but I haven't found a solution yet (here or elsewhere).
I have a few columns, which I want to average in R. The only minimally tricky aspect is that some columns contain NAs.
For example:
Trait Col1 Col2 Col3
DF 23 NA 23
DG 2 2 2
DH NA 9 9
I want to create a Col4 that averages the entries in the first 3 columns, ignoring the NAs. So:
Trait Col1 Col2 Col3 Col4
DF 23 NA 23 23
DG 2 2 2 2
DH NA 9 9 9
Ideally something like this would work:
data$Col4 <- mean(data$Chr1, data$Chr2, data$Chr3, na.rm=TRUE)
but it doesn't.
Upvotes: 28
Views: 96828
Reputation: 1724
Why NOT the accepted answer?
The accepted answer is correct, however, it is too specific to this particular task and impossible to be generalized. What if we need, instead of mean
, other statistics like var
, skewness
, etc. , or even a custom function?
A more flexible solution:
row_means <- apply(X=data, MARGIN=1, FUN=mean, na.rm=TRUE)
More details on apply
:
Generally, to apply any function (custom or built-in) on the entire dataset, column-wise or row-wise, apply
or one of its variations (sapply
, lapply`, ...) should be used. Its signature is:
apply(X, MARGIN, FUN, na.rm)
where:
X
: The data of form dataframe or matrix.MARGIN
: The dimension on which the aggregation takes place. Use 1
for row-wise operation and 2
for column-wise operation. FUN
: The operation to be called on the data. Here any pre-defined R functions, as well as any user-defined function could be used.na.rm
: If TRUE
, the NA
values will be removed before FUN
is called.Why should I use apply
?
For many reasons, including but not limited to:
apply
.lapply
for operations on lists).mclapply
from {parallel}
library). For instance, see [+] or [+].Upvotes: 7
Reputation: 174813
You want rowMeans()
but importantly note it has a na.rm
argument that you want to set to TRUE
. E.g.:
> mat <- matrix(c(23,2,NA,NA,2,9,23,2,9), ncol = 3)
> mat
[,1] [,2] [,3]
[1,] 23 NA 23
[2,] 2 2 2
[3,] NA 9 9
> rowMeans(mat)
[1] NA 2 NA
> rowMeans(mat, na.rm = TRUE)
[1] 23 2 9
To match your example:
> dat <- data.frame(Trait = c("DF","DG","DH"), mat)
> names(dat) <- c("Trait", paste0("Col", 1:3))
> dat
Trait Col1 Col2 Col3
1 DF 23 NA 23
2 DG 2 2 2
3 DH NA 9 9
> dat <- transform(dat, Col4 = rowMeans(dat[,-1], na.rm = TRUE))
> dat
Trait Col1 Col2 Col3 Col4
1 DF 23 NA 23 23
2 DG 2 2 2 2
3 DH NA 9 9 9
Upvotes: 35