Reputation: 609
I'm trying to summarize (take the mean) of various columns based on a single column within a data.table.
Here's a toy example of my data and the code I used that shows the problem I'm having:
library(data.table)
a<- data.table(
a=c(1213.1,NA,113.41,133.4,121.1,45.34),
b=c(14.131,NA,1.122,113.11,45.123,344.3),
c=c(101.2,NA,232.1,194.3,12.12,7645.3),
d=c(11.32,NA,32.121,94.3213,1223.1,34.1),
e=c(1311.32,NA,12.781,13.2,2.1,623.2),
f=c("A", "B", "B", "A", "B", "X"))
a
setkey(a,f) # column "f" is what I want to summarize columns by
a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)] # I just want to summarize first 4 columns
The output of the last line:
> a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)]
f a b c d
1: A 673.25 63.6205 147.75 52.82065
2: B NA NA NA NA
3: X 45.34 344.3000 7645.30 34.10000
Why are B entries NA
? Shouldn't NA
be ignored in the calculation of the mean? I think I found a similar issue here, but perhaps this is different and/or I've got the syntax messed up.
If this isn't possible in data.table
, I'm open to other suggestions.
Upvotes: 2
Views: 2806
Reputation: 8717
In R, the default behavior of the mean()
function is to output NA
if there are missing values. To ignore NA
s in the mean calculation, you need to set the argument na.rm=TRUE
. lapply
takes in additional arguments to the function it is passed, so for your problem, you can try
a[, lapply(.SD, mean, na.rm=TRUE), by=f, .SDcols=c(1:4)]
Upvotes: 3