coding_heart
coding_heart

Reputation: 1295

data.table aggregation having trouble with NAs

I'm using the data.table package to aggregate a panel with multiple observations in each time-period for a given location into a panel in which there is a unique observation for a time-period-location. However, I'm having trouble aggregating for observations with an NA. While this works well:

set.seed(123)
data.frame(name = c("a", "a", "a", "b", "b", "b"), t = rep(c(1,2),3), x1 = sample(0:10,6), x2 =  sample(0:10,6))
f1

  name t x1 x2
    a 1  3  5
    a 2  7  8
    a 1 10  4
    b 2  9  3
    b 1  6  6
    b 2  0  2

f2 <- setDT(f1)[, lapply(.SD, sum(x, na.rm = TRUE)), by = .(name,t), .SDcols = c("x1", "x2")]
f2

   name t x1 x2
    a 1 13  9
    a 2  7  8
    b 2  9  5
    b 1  6  6

Adding an NA prevents sum() from working:

f1[1,3] <- NA
f1
   name t x1 x2
    a 1 NA  5
    a 2  7  8
    a 1 10  4
    b 2  9  3
    b 1  6  6
    b 2  0  2

f2 <- setDT(f1)[, lapply(.SD, sum(x, na.rm = TRUE)), by = .(name,t), .SDcols = c("x1", "x2")]
f2
name t x1 x2
    a 1 NA  9
    a 2  7  8
    b 2  9  5
    b 1  6  6

Whereas the x1 value for a, 1 should be 10. Any idea why this is happening even with the na.rm argument inside? Thanks.

Upvotes: 2

Views: 268

Answers (1)

Jaap
Jaap

Reputation: 83275

This is an issue of placing the brackets in the right spot. You should use lapply(.SD, sum, na.rm = TRUE) instead of lapply(.SD, sum(x, na.rm = TRUE)). When using lapply, you have to add the extra parameters of the sum-function after the call for sum and not inside the call for sum. Furtherore, you don't need the (x)-part in sum(x). Because the structure of lapply is lapply(X, FUN, ...), the function specified in the FUN-part is automatically applied to the variable in the X-part.

So, your code should be:

setDT(f1)[, lapply(.SD, sum, na.rm = TRUE), by = .(name,t)]

which gives:

   name t x1 x2
1:    a 1  0 11
2:    a 2  5  9
3:    b 2 18 12
4:    b 1  1  4

Note: I left the .SDcols = c("x1", "x2") part out of the code as it is not needed in this case. Including it will give you the same result.

Upvotes: 5

Related Questions