Reputation: 733
I have a data frame (df) with missing values and want to impute interpolated values with restriction. My data frame is:
X<-c(100,NA,NA,70,NA,NA,NA,NA,NA,NA,35)
Y<-c(10,NA,NA,40,NA,NA,NA,NA,NA,NA,5)
Z<-c(50,NA,NA,20,NA,NA,NA,NA,NA,NA,90)
df<-as.data.frame(cbind(X,Y,Z))
df
X Y Z
1 100 10 50
2 NA NA NA
3 NA NA NA
4 70 40 20
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
11 35 5 90
I was able to impute missing values from linear interpolation of the known values using:
data.frame(lapply(df, function(X) approxfun(seq_along(X), X)(seq_along(X))))
X Y Z
1 100 10 50
2 90 20 40
3 80 30 30
4 70 40 20
5 65 35 30
6 60 30 40
7 55 25 50
8 50 20 60
9 45 15 70
10 40 10 80
11 35 5 90
My question is how can I put constraint to the interpolation? Say NAs more than 5 consecutive entries should remain as NAs and not be imputed by linear interpolation so that my new data frame would look like:
X Y Z
1 100 10 50
2 90 20 40
3 80 30 30
4 70 40 20
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
11 35 5 90
Upvotes: 1
Views: 256
Reputation: 8146
Here's something that works. It uses na.rm
to identify NA
s, rle
to identify runs of NA
s, and then cumsum
to translate those runs into positions in the vector.
data.frame(lapply(df, function(X) {
af = approxfun(seq_along(X), X)
rl = rle(is.na(X))
cu = cumsum(rl$length)
L=5
unlist(sapply(1:length(cu), function(x) {
if (rl$values[x] & rl$length[x]>L) rep(NA, rl$lengths[x])
else af(seq(cu[x]-rl$lengths[x]+1,cu[x]))
}))
}))
Upvotes: 3