Reputation: 1295
I'm using the data.table
package to aggregate a panel with multiple observations in each time-period for a given location into a panel in which there is a unique observation for a time-period-location. However, I'm having trouble aggregating for observations with an NA. While this works well:
set.seed(123)
data.frame(name = c("a", "a", "a", "b", "b", "b"), t = rep(c(1,2),3), x1 = sample(0:10,6), x2 = sample(0:10,6))
f1
name t x1 x2
a 1 3 5
a 2 7 8
a 1 10 4
b 2 9 3
b 1 6 6
b 2 0 2
f2 <- setDT(f1)[, lapply(.SD, sum(x, na.rm = TRUE)), by = .(name,t), .SDcols = c("x1", "x2")]
f2
name t x1 x2
a 1 13 9
a 2 7 8
b 2 9 5
b 1 6 6
Adding an NA prevents sum()
from working:
f1[1,3] <- NA
f1
name t x1 x2
a 1 NA 5
a 2 7 8
a 1 10 4
b 2 9 3
b 1 6 6
b 2 0 2
f2 <- setDT(f1)[, lapply(.SD, sum(x, na.rm = TRUE)), by = .(name,t), .SDcols = c("x1", "x2")]
f2
name t x1 x2
a 1 NA 9
a 2 7 8
b 2 9 5
b 1 6 6
Whereas the x1 value for a, 1 should be 10. Any idea why this is happening even with the na.rm argument inside? Thanks.
Upvotes: 2
Views: 268
Reputation: 83275
This is an issue of placing the brackets in the right spot. You should use lapply(.SD, sum, na.rm = TRUE)
instead of lapply(.SD, sum(x, na.rm = TRUE))
. When using lapply
, you have to add the extra parameters of the sum
-function after the call for sum
and not inside the call for sum
. Furtherore, you don't need the (x)
-part in sum(x)
. Because the structure of lapply
is lapply(X, FUN, ...)
, the function specified in the FUN
-part is automatically applied to the variable in the X
-part.
So, your code should be:
setDT(f1)[, lapply(.SD, sum, na.rm = TRUE), by = .(name,t)]
which gives:
name t x1 x2
1: a 1 0 11
2: a 2 5 9
3: b 2 18 12
4: b 1 1 4
Note: I left the .SDcols = c("x1", "x2")
part out of the code as it is not needed in this case. Including it will give you the same result.
Upvotes: 5