Reputation: 2677
If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is
setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]
which is quite complex. It's a shame since roll
is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf
within each group)
I can write a convenience function to fill in missing values
fill_na <- function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE)){
id <- seq_along(x)
if (is.null(by)){
DT <- data.table("x" = x, "id" = id, key = "id")
return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
} else{
DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id"))
return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
}
}
And then write
setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]
This is not really satisfying since one would like to write
setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]
However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na
should be called with the by
option, and should not be used with data.table
by
. Is there an elegant solution around this?
Some speed test
N <- 2e6
set.seed(1)
DT <- data.table(
date = sample(10, N, TRUE),
id = sample(1e5, N, TRUE),
value = sample(c(NA,1:5), N, TRUE),
value2 = sample(c(NA,1:5), N, TRUE)
)
setkey(DT,id,date)
DT<- unique(DT)
system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user system elapsed
#> 0.086 0.006 0.105
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user system elapsed
#> 5.235 0.016 5.274
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#> user system elapsed
#> 0.194 0.019 0.221
system.time(DT[, filled3 := fill_na(value), by = id])
#> user system elapsed
#> 237.256 0.913 238.405
Why don't I just use na.locf.default
? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.
Upvotes: 22
Views: 11727
Reputation: 1438
There is now a native data.table
way of filling missing values (as of 1.12.4
).
This question spawned a github issue which was recently closed with the creation of functions nafill
and setnafill
. You can now use
DT[, value_filled_in := nafill(value, type = "locf")]
It is also possible to fill NA
with a constant value or next observation carried back.
One difference to the approach in the question is that these functions currently only work on NA
not NaN
whereas is.na
is TRUE
for NaN
- this is planned to be fixed in the next release through an extra argument.
I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.
Update: By default NaN
is now treated same as NA
.
Upvotes: 17
Reputation: 49448
Here's a slightly faster and more compact way of doing it (version 1.9.3+):
DT[, filled4 := DT[!is.na(value)][DT, value, roll = T]]
Upvotes: 14