BiXiC
BiXiC

Reputation: 973

Replace NA with values from previous date

I have dated data frame like this one with approximately 1 million rows

  id       date   variable
1  1 2015-01-01         NA
2  1 2015-01-02 -1.1874087
3  1 2015-01-03 -0.5936396
4  1 2015-01-04 -0.6131957
5  1 2015-01-05  1.0291688
6  1 2015-01-06 -1.5810152

Reproducible example is here:

#create example data set
Df <- data.frame(id = factor(rep(1:3, each = 10)), 
     date = rep(seq.Date(from = as.Date('2015-01-01'), 
             to = as.Date('2015-01-10'), by = 1),3),
     variable = rnorm(30))
Df$variable[c(1,7,12,18,22,23,29)] <- NA

What I want to do is replace NA values in variable with values from previous date for each id. I created loop which works but very slow (You can find it below). Can you please advice fast alternative for this task. Thank you!

library(dplyr)

#create new variable
Df$variableNew <- Df$variable
#create row numbers vector
Df$n <- 1:dim(Df)[1]
#order data frame by date
Df <- arrange(Df, date)


for (id in levels(Df$id)){
    I <- Df$n[Df$id == id] # create vector of rows for specific id

    for (row in 1:length(I)){ #if variable == NA for the first date change it to mean value
        if (is.na(Df$variableNew[I[1]])) {
            Df$variableNew[I[row]] <- mean(Df$variable,na.rm = T)
        }
        if (is.na(Df$variableNew[I[row]])){ # if variable == NA fassign to this date value from previous date
            Df$variableNew[I[row]] <- Df$variableNew[I[row-1]]
        }
    }
}

Upvotes: 0

Views: 1706

Answers (2)

jlhoward
jlhoward

Reputation: 59395

This data.table solution should be extremely fast.

library(zoo)         # for na.locf(...)
library(data.table)
setDT(Df)[,variable:=na.locf(variable, na.rm=FALSE),by=id]
Df[,variable:=if (is.na(variable[1])) c(mean(variable,na.rm=TRUE),variable[-1]) else variable,by=id]
Df
#     id       date     variable
#  1:  1 2015-01-01 -0.288720759
#  2:  1 2015-01-02 -0.005344028
#  3:  1 2015-01-03  0.707310667
#  4:  1 2015-01-04  1.034107735
#  5:  1 2015-01-05  0.223480415
#  6:  1 2015-01-06 -0.878707613
#  7:  1 2015-01-07 -0.878707613
#  8:  1 2015-01-08 -2.000164945
#  9:  1 2015-01-09 -0.544790740
# 10:  1 2015-01-10 -0.255670709
# ...

So this replaces all embedded NA using locf by id, and then makes a second pass replacing any leading NA with the average of variable for that id. Note that if you do this is the reverse order you may get a different answer.

Upvotes: 3

jeremycg
jeremycg

Reputation: 24945

If you get the dev version of tidyr(0.3.0) available on github, there is a function fill which will do this exactly:

#devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
Df %>% group_by(id) %>% 
       fill(variable)

It will not do the first value - We can do that with a mutate and replace:

Df %>% group_by(id) %>%
       mutate(variable = ifelse(is.na(variable) & row_number()==1, 
                                replace(variable, 1, mean(variable, na.rm = TRUE)),
                                variable)) %>% 
       fill(variable)

Upvotes: 1

Related Questions