Kaur
Kaur

Reputation: 279

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :

data <- data.frame(
  group = c(1,1,1,1,1,2),
  dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
            "2005-03-01","2010-02-01")),
  value = c(10,20,NA,40,NA,5)
)

For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.

I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do. Any pointers to the solution?

May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))] Not sure how to specify Thisdate.

Edit: The expected value vector should look like:

value = c(10,20,20,40,10,5)

Upvotes: 0

Views: 791

Answers (2)

Onyambu
Onyambu

Reputation: 79288

You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.

first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.

library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
  group      dates value
1     1 2005-01-01    10
2     1 2006-05-01    20
3     1 2007-05-01    20
4     1 2004-08-01    40
5     1 2005-03-01    10
6     2 2010-02-01     5

if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:

library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
              value=value[sapply(dates,which.closest,vect)])%>%
     select(-vect)

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269754

Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.

library(class)

na.knn1 <- function(x, time) {
   is_na <- is.na(x)
   if (sum(is_na) == 0 || all(is_na)) return(x)
   train <- matrix(time[!is_na])
   test <- matrix(time[is_na])
   cl <- x[!is_na]
   x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
   x
}

data %>% mutate(value = na.knn1(value, dates))

giving:

  group      dates value
1     1 2005-01-01    10
2     1 2006-05-01    20
3     1 2007-05-01    20
4     1 2004-08-01    40
5     1 2005-03-01    10
6     2 2010-02-01     5

Add an appropriate group_by if the intention was to do this by group.

Upvotes: 2

Related Questions