Reputation: 954
I would like to impute values for NA observations at the beginning of the array, using a linear approximation of the following two non-NA observations to extrapolate the missing value. Then do the same for the NA observations at the end of the array, using the preceding two non-NA observations.
A reproducible example of my df:
M=matrix(sample(1:9,10*10,T),10);M[sample(1:length(M),0.5*length(M),F)]=NA;dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
M
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 NA 4 5 NA 3 NA NA NA 5 NA
City2 6 NA 3 3 NA 4 6 NA NA 7
City3 NA 7 NA 8 8 NA NA 8 NA 5
City4 3 5 3 NA NA 3 5 9 8 7
City5 4 6 6 NA NA 8 NA 7 1 NA
City6 NA NA NA NA 4 NA 8 3 6 7
City7 9 3 NA NA NA NA NA 4 NA NA
City8 5 6 9 8 5 NA NA 1 4 NA
City9 NA NA 6 NA 3 3 8 NA 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
idx=rowSums(!is.na(M))>=2 # Index of rows with 2 or more non-NA to run na.approx
library(zoo)
M[idx,]=t(na.approx(t(M[idx,]),rule=1,method="linear")) # I'm using t as na.approx works on columns
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 NA 4.0 5 4.0 3.000000 3.50 4.0 4.5 5 NA
City2 6.0 5.5 3 3.0 5.500000 4.00 6.0 6.0 6 7
City3 4.5 7.0 3 8.0 8.000000 3.50 5.5 8.0 7 5
City4 3.0 5.0 3 8.0 6.666667 3.00 5.0 9.0 8 7
City5 4.0 6.0 6 8.0 5.333333 8.00 6.5 7.0 1 7
City6 6.5 4.5 7 8.0 4.000000 6.75 8.0 3.0 6 7
City7 9.0 3.0 8 8.0 4.500000 5.50 8.0 4.0 5 NA
City8 5.0 6.0 9 8.0 5.000000 4.25 8.0 1.0 4 NA
City9 NA NA 6 4.5 3.000000 3.00 8.0 7.5 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
I would like to extrapolate the boundaries (for City1
and City9
) using a linear approximation based on the two preceding/following observations. For example M[1,1]
should be 3
and M[1,10]
should be 5,5
.
Do you know how I could do this?
Upvotes: 3
Views: 288
Reputation: 269526
In extrap
, nlead
is the number of leading NAs in input vector x
. non.na
is the subset of elements of x
which are not NA. Return the input if there are no leading NA elements or if there are fewer than 2 non-NA elements. m
is the slope of the first two non-NAs. Replace the first nlead
elements of x
with the extrapolation. Finally we apply extrap
to each row of M
using MM[] <-
so column names are preserved and then reverse each row, repeat and reverse back:
library(zoo)
extrap <- function(x) {
nlead <- which.min(x * 0) - 1
non.na <- na.omit(x)
if (length(nlead) == 0 || nlead == 0) || length(non.na) < 2) return(x)
m <- diff(head(non.na, 2))
replace(x, seq_len(nlead), non.na[1] - nlead:1 * m)
}
nc <- ncol(M)
naApprox <- function(x) if (length(na.omit(x)) < 2) x else na.approx(x, na.rm = FALSE)
MM <- M
MM[] <- t(apply(MM, 1, naApprox))
MM[] <- t(apply(MM, 1, extrap)) # extraploate to fill leading NAs
MM[] <- t(apply(MM[, nc:1], 1, extrap))[, nc:1] # extrapolate to fill trailing NAs
giving:
> MM
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 3.0 4.0 5.000000 4.000000 3.000000 3.500000 4.000000 4.500000 5.000000 5.500000
City2 6.0 4.5 3.000000 3.000000 3.500000 4.000000 6.000000 6.333333 6.666667 7.000000
City3 6.5 7.0 7.500000 8.000000 8.000000 8.000000 8.000000 8.000000 6.500000 5.000000
City4 3.0 5.0 3.000000 3.000000 3.000000 3.000000 5.000000 9.000000 8.000000 7.000000
City5 4.0 6.0 6.000000 6.666667 7.333333 8.000000 7.500000 7.000000 1.000000 -5.000000
City6 -4.0 -2.0 0.000000 2.000000 4.000000 6.000000 8.000000 3.000000 6.000000 7.000000
City7 9.0 3.0 3.166667 3.333333 3.500000 3.666667 3.833333 4.000000 4.166667 4.333333
City8 5.0 6.0 9.000000 8.000000 5.000000 3.666667 2.333333 1.000000 4.000000 7.000000
City9 9.0 7.5 6.000000 4.500000 3.000000 3.000000 8.000000 7.500000 7.000000 6.500000
City10 NA NA NA NA NA NA NA NA NA 1.000000
Note We used this as M
:
M <- structure(c(NA, 6L, NA, 3L, 4L, NA, 9L, 5L, NA, NA, 4L, NA, 7L,
5L, 6L, NA, 3L, 6L, NA, NA, 5L, 3L, NA, 3L, 6L, NA, NA, 9L, 6L,
NA, NA, 3L, 8L, NA, NA, NA, NA, 8L, NA, NA, 3L, NA, 8L, NA, NA,
4L, NA, 5L, 3L, NA, NA, 4L, NA, 3L, 8L, NA, NA, NA, 3L, NA, NA,
6L, NA, 5L, NA, 8L, NA, NA, 8L, NA, NA, NA, 8L, 9L, 7L, 3L, 4L,
1L, NA, NA, 5L, NA, NA, 8L, 1L, 6L, NA, 4L, 7L, NA, NA, 7L, 5L,
7L, NA, 7L, NA, NA, NA, 1L), .Dim = c(10L, 10L), .Dimnames = list(
c("City1", "City2", "City3", "City4", "City5", "City6", "City7",
"City8", "City9", "City10"), c("Year1", "Year2", "Year3",
"Year4", "Year5", "Year6", "Year7", "Year8", "Year9", "Year10"
)))
Update: Fixed.
Upvotes: 2
Reputation: 23014
This gives you the first column with the linearly-extrapolated values filled in for NA
. You can adapt for the last column.
firstNAfill <- function(x) {
ans <- ifelse(!is.na(x[1]),
x[1],
ifelse(sum(!is.na(x))<2, NA,
2*x[which(!is.na(x[1, ]))[1]] - x[which(!is.na(x[1, ]))[2]]
)
)
return(ans)
}
dat$Year1 <- unlist(lapply(seq(1:nrow(dat)), function(x) {firstNAfill(dat[x, ])}))
Result:
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 3.0 4.0 5 4.0 3.000000 3.50 4.0 4.5 5 NA
City2 6.0 5.5 3 3.0 5.500000 4.00 6.0 6.0 6 7
City3 4.5 7.0 3 8.0 8.000000 3.50 5.5 8.0 7 5
City4 3.0 5.0 3 8.0 6.666667 3.00 5.0 9.0 8 7
City5 4.0 6.0 6 8.0 5.333333 8.00 6.5 7.0 1 7
City6 6.5 4.5 7 8.0 4.000000 6.75 8.0 3.0 6 7
City7 9.0 3.0 8 8.0 4.500000 5.50 8.0 4.0 5 NA
City8 5.0 6.0 9 8.0 5.000000 4.25 8.0 1.0 4 NA
City9 7.5 NA 6 4.5 3.000000 3.00 8.0 7.5 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
The function returns the first column's current value if not NA
, NA
if there aren't two values to extrapolate from, and the extrapolated value otherwise.
Upvotes: 1