Getting NA when summarizing by columns in data.table

Question

I'm trying to summarize (take the mean) of various columns based on a single column within a data.table.

Here's a toy example of my data and the code I used that shows the problem I'm having:

library(data.table)
a<- data.table(
  a=c(1213.1,NA,113.41,133.4,121.1,45.34),
  b=c(14.131,NA,1.122,113.11,45.123,344.3),
  c=c(101.2,NA,232.1,194.3,12.12,7645.3),
  d=c(11.32,NA,32.121,94.3213,1223.1,34.1),
  e=c(1311.32,NA,12.781,13.2,2.1,623.2),
  f=c("A", "B", "B", "A", "B", "X"))
a
setkey(a,f) # column "f" is what I want to summarize columns by

a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)] # I just want to summarize first 4 columns

The output of the last line:

> a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)]
 f a b c d
1: A 673.25 63.6205 147.75 52.82065
2: B NA NA NA NA
3: X 45.34 344.3000 7645.30 34.10000

Why are B entries NA? Shouldn't NA be ignored in the calculation of the mean? I think I found a similar issue here, but perhaps this is different and/or I've got the syntax messed up.

If this isn't possible in data.table, I'm open to other suggestions.

ialm · Accepted Answer

In R, the default behavior of the mean() function is to output NA if there are missing values. To ignore NAs in the mean calculation, you need to set the argument na.rm=TRUE. lapply takes in additional arguments to the function it is passed, so for your problem, you can try

a[, lapply(.SD, mean, na.rm=TRUE), by=f, .SDcols=c(1:4)]

Getting NA when summarizing by columns in data.table

Answers (1)

Related Questions