Reputation: 16277
I have a date vector with leading NAs and I'd like to generate an approximate sequence for these NAs using na.approx
from package zoo
.
na.approx
does not work for leading NAs:
x <- as.Date(c(rep(NA,3),"1992-01-16","1992-04-16","1992-07-16",
"1992-10-16","1993-01-15","1993-04-16","1993-07-17"))
as.Date(na.approx(x,na.rm=FALSE))
[1] NA NA NA "1992-01-16" "1992-04-16"
1992-07-16" "1992-10-16" "1993-01-15" "1993-04-16" "1993-07-17"
I thought that I could reverse my vector using rev
but I still get NAs
as.Date(na.approx(rev(x),na.rm=FALSE))
[1] "1993-07-17" "1993-04-16" "1993-01-15" "1992-10-16" "1992-07-16"
"1992-04-16" "1992-01-16" NA NA NA
Any ideas?
Upvotes: 3
Views: 286
Reputation: 16277
Found my answer. na.spline
does a good job with lots of data. In the example above, I have few dates which causes the approximation to drift. However, in my real life example, there is no drift.
as.Date(na.spline(x,na.rm=FALSE))
[1] "1993-07-17" "1993-04-16" "1993-01-15" "1992-10-16" "1992-07-16"
"1992-04-16" "1992-01-16" "1991-10-15" "1991-07-13" "1991-04-06"
Upvotes: 2
Reputation: 25854
na.approx
requires a rule
to be passed for values outside the min
or max
value of
your vector. If rule=2
is used then the missing values are imputed with the nearest value.
as.Date(na.approx(x,na.rm=FALSE, rule=2))
# [1] "1992-01-16" "1992-01-16" "1992-01-16" "1992-01-16" "1992-04-16" "1992-07-16" "1992-10-16" "1993-01-15"
# [9] "1993-04-16" "1993-07-17"
As an alternative you can use na.spline
(as in your answer). You mention it can get a bit wild
so you can write a function to impute the values based on the time difference between your measures.
I use the first non-missing difference here
add_leading_seq_dates <- function(x) {
first_non_missing = which.min(is.na(x))
first_day_diff = na.omit(diff(x))[1]
no_of_leadng_missing = first_non_missing - 1
input_dates = x[first_non_missing] - cumsum(rep(first_day_diff, no_of_leadng_missing))
x[is.na(x)] = rev(input_dates)
x
}
add_leading_seq_dates(x)
# [1] "1991-04-18" "1991-07-18" "1991-10-17" "1992-01-16" "1992-04-16"
# [6] "1992-07-16" "1992-10-16" "1993-01-15" "1993-04-16" "1993-07-17"
diff(add_leading_seq_dates(x))
# Time differences in days
# [1] 91 91 91 91 91 92 91 91 92
Upvotes: 2