Reputation: 973
I have dated data frame like this one with approximately 1 million rows
id date variable
1 1 2015-01-01 NA
2 1 2015-01-02 -1.1874087
3 1 2015-01-03 -0.5936396
4 1 2015-01-04 -0.6131957
5 1 2015-01-05 1.0291688
6 1 2015-01-06 -1.5810152
Reproducible example is here:
#create example data set
Df <- data.frame(id = factor(rep(1:3, each = 10)),
date = rep(seq.Date(from = as.Date('2015-01-01'),
to = as.Date('2015-01-10'), by = 1),3),
variable = rnorm(30))
Df$variable[c(1,7,12,18,22,23,29)] <- NA
What I want to do is replace NA values in variable
with values from previous date for each id
. I created loop which works but very slow (You can find it below). Can you please advice fast alternative for this task. Thank you!
library(dplyr)
#create new variable
Df$variableNew <- Df$variable
#create row numbers vector
Df$n <- 1:dim(Df)[1]
#order data frame by date
Df <- arrange(Df, date)
for (id in levels(Df$id)){
I <- Df$n[Df$id == id] # create vector of rows for specific id
for (row in 1:length(I)){ #if variable == NA for the first date change it to mean value
if (is.na(Df$variableNew[I[1]])) {
Df$variableNew[I[row]] <- mean(Df$variable,na.rm = T)
}
if (is.na(Df$variableNew[I[row]])){ # if variable == NA fassign to this date value from previous date
Df$variableNew[I[row]] <- Df$variableNew[I[row-1]]
}
}
}
Upvotes: 0
Views: 1706
Reputation: 59395
This data.table solution should be extremely fast.
library(zoo) # for na.locf(...)
library(data.table)
setDT(Df)[,variable:=na.locf(variable, na.rm=FALSE),by=id]
Df[,variable:=if (is.na(variable[1])) c(mean(variable,na.rm=TRUE),variable[-1]) else variable,by=id]
Df
# id date variable
# 1: 1 2015-01-01 -0.288720759
# 2: 1 2015-01-02 -0.005344028
# 3: 1 2015-01-03 0.707310667
# 4: 1 2015-01-04 1.034107735
# 5: 1 2015-01-05 0.223480415
# 6: 1 2015-01-06 -0.878707613
# 7: 1 2015-01-07 -0.878707613
# 8: 1 2015-01-08 -2.000164945
# 9: 1 2015-01-09 -0.544790740
# 10: 1 2015-01-10 -0.255670709
# ...
So this replaces all embedded NA
using locf by id
, and then makes a second pass replacing any leading NA
with the average of variable
for that id
. Note that if you do this is the reverse order you may get a different answer.
Upvotes: 3
Reputation: 24945
If you get the dev version of tidyr
(0.3.0) available on github, there is a function fill
which will do this exactly:
#devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
Df %>% group_by(id) %>%
fill(variable)
It will not do the first value - We can do that with a mutate and replace:
Df %>% group_by(id) %>%
mutate(variable = ifelse(is.na(variable) & row_number()==1,
replace(variable, 1, mean(variable, na.rm = TRUE)),
variable)) %>%
fill(variable)
Upvotes: 1