Reputation: 279
A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group
, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr
. which.closest
from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL
solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate
.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Upvotes: 0
Views: 791
Reputation: 79288
You can try the use of sapply
to find the values closest since the x
argument in `which.closest only takes a single value.
first create a vect
whereby the dates with no values are replaced with NA
and use it within the which.closest
function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest
was to take a vector then there would be no need of sapply
. But this is not the case.
Using the dplyr
package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)
Upvotes: 0
Reputation: 269754
Using knn1
(nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1
function which replaces each NA value in x
with the non-NA x
value having the closest time
.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by
if the intention was to do this by group.
Upvotes: 2