goclem
goclem

Reputation: 954

Imputation for bounding NA observations using a linear approximation

I would like to impute values for NA observations at the beginning of the array, using a linear approximation of the following two non-NA observations to extrapolate the missing value. Then do the same for the NA observations at the end of the array, using the preceding two non-NA observations.

A reproducible example of my df:

M=matrix(sample(1:9,10*10,T),10);M[sample(1:length(M),0.5*length(M),F)]=NA;dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
    M

       Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1     NA     4     5    NA     3    NA    NA    NA     5     NA
City2      6    NA     3     3    NA     4     6    NA    NA      7
City3     NA     7    NA     8     8    NA    NA     8    NA      5
City4      3     5     3    NA    NA     3     5     9     8      7
City5      4     6     6    NA    NA     8    NA     7     1     NA
City6     NA    NA    NA    NA     4    NA     8     3     6      7
City7      9     3    NA    NA    NA    NA    NA     4    NA     NA
City8      5     6     9     8     5    NA    NA     1     4     NA
City9     NA    NA     6    NA     3     3     8    NA     7     NA
City10    NA    NA    NA    NA    NA    NA    NA    NA    NA      1

idx=rowSums(!is.na(M))>=2 # Index of rows with 2 or more non-NA to run na.approx

library(zoo)
M[idx,]=t(na.approx(t(M[idx,]),rule=1,method="linear")) # I'm using t as na.approx works on columns

       Year1 Year2 Year3 Year4    Year5 Year6 Year7 Year8 Year9 Year10
City1     NA   4.0     5   4.0 3.000000  3.50   4.0   4.5     5     NA
City2    6.0   5.5     3   3.0 5.500000  4.00   6.0   6.0     6      7
City3    4.5   7.0     3   8.0 8.000000  3.50   5.5   8.0     7      5
City4    3.0   5.0     3   8.0 6.666667  3.00   5.0   9.0     8      7
City5    4.0   6.0     6   8.0 5.333333  8.00   6.5   7.0     1      7
City6    6.5   4.5     7   8.0 4.000000  6.75   8.0   3.0     6      7
City7    9.0   3.0     8   8.0 4.500000  5.50   8.0   4.0     5     NA
City8    5.0   6.0     9   8.0 5.000000  4.25   8.0   1.0     4     NA
City9     NA    NA     6   4.5 3.000000  3.00   8.0   7.5     7     NA
City10    NA    NA    NA    NA       NA    NA    NA    NA    NA      1

I would like to extrapolate the boundaries (for City1 and City9) using a linear approximation based on the two preceding/following observations. For example M[1,1] should be 3 and M[1,10] should be 5,5.

Do you know how I could do this?

Upvotes: 3

Views: 288

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269526

In extrap, nlead is the number of leading NAs in input vector x. non.na is the subset of elements of x which are not NA. Return the input if there are no leading NA elements or if there are fewer than 2 non-NA elements. m is the slope of the first two non-NAs. Replace the first nlead elements of x with the extrapolation. Finally we apply extrap to each row of M using MM[] <- so column names are preserved and then reverse each row, repeat and reverse back:

library(zoo)

extrap <- function(x) {
    nlead <- which.min(x * 0) - 1
    non.na <- na.omit(x)
    if (length(nlead) == 0 || nlead == 0) || length(non.na) < 2) return(x)
    m <- diff(head(non.na, 2))      
    replace(x, seq_len(nlead), non.na[1] - nlead:1 * m)
}

nc <- ncol(M)

naApprox <- function(x) if (length(na.omit(x)) < 2) x else na.approx(x, na.rm = FALSE)
MM <- M
MM[] <- t(apply(MM, 1, naApprox))

MM[] <- t(apply(MM, 1, extrap)) # extraploate to fill leading NAs
MM[] <- t(apply(MM[, nc:1], 1, extrap))[, nc:1] # extrapolate to fill trailing NAs

giving:

> MM
       Year1 Year2    Year3    Year4    Year5    Year6    Year7    Year8    Year9    Year10
City1    3.0   4.0 5.000000 4.000000 3.000000 3.500000 4.000000 4.500000 5.000000  5.500000
City2    6.0   4.5 3.000000 3.000000 3.500000 4.000000 6.000000 6.333333 6.666667  7.000000
City3    6.5   7.0 7.500000 8.000000 8.000000 8.000000 8.000000 8.000000 6.500000  5.000000
City4    3.0   5.0 3.000000 3.000000 3.000000 3.000000 5.000000 9.000000 8.000000  7.000000
City5    4.0   6.0 6.000000 6.666667 7.333333 8.000000 7.500000 7.000000 1.000000 -5.000000
City6   -4.0  -2.0 0.000000 2.000000 4.000000 6.000000 8.000000 3.000000 6.000000  7.000000
City7    9.0   3.0 3.166667 3.333333 3.500000 3.666667 3.833333 4.000000 4.166667  4.333333
City8    5.0   6.0 9.000000 8.000000 5.000000 3.666667 2.333333 1.000000 4.000000  7.000000
City9    9.0   7.5 6.000000 4.500000 3.000000 3.000000 8.000000 7.500000 7.000000  6.500000
City10    NA    NA       NA       NA       NA       NA       NA       NA       NA  1.000000

Note We used this as M:

M <- structure(c(NA, 6L, NA, 3L, 4L, NA, 9L, 5L, NA, NA, 4L, NA, 7L, 
5L, 6L, NA, 3L, 6L, NA, NA, 5L, 3L, NA, 3L, 6L, NA, NA, 9L, 6L, 
NA, NA, 3L, 8L, NA, NA, NA, NA, 8L, NA, NA, 3L, NA, 8L, NA, NA, 
4L, NA, 5L, 3L, NA, NA, 4L, NA, 3L, 8L, NA, NA, NA, 3L, NA, NA, 
6L, NA, 5L, NA, 8L, NA, NA, 8L, NA, NA, NA, 8L, 9L, 7L, 3L, 4L, 
1L, NA, NA, 5L, NA, NA, 8L, 1L, 6L, NA, 4L, 7L, NA, NA, 7L, 5L, 
7L, NA, 7L, NA, NA, NA, 1L), .Dim = c(10L, 10L), .Dimnames = list(
    c("City1", "City2", "City3", "City4", "City5", "City6", "City7", 
    "City8", "City9", "City10"), c("Year1", "Year2", "Year3", 
    "Year4", "Year5", "Year6", "Year7", "Year8", "Year9", "Year10"
    )))

Update: Fixed.

Upvotes: 2

Sam Firke
Sam Firke

Reputation: 23014

This gives you the first column with the linearly-extrapolated values filled in for NA. You can adapt for the last column.

firstNAfill <- function(x) {
  ans <- ifelse(!is.na(x[1]),
                x[1],
                ifelse(sum(!is.na(x))<2, NA,
                       2*x[which(!is.na(x[1, ]))[1]] - x[which(!is.na(x[1, ]))[2]]
                )
  )
  return(ans)
}


dat$Year1 <- unlist(lapply(seq(1:nrow(dat)), function(x) {firstNAfill(dat[x, ])}))

Result:

       Year1 Year2 Year3 Year4    Year5 Year6 Year7 Year8 Year9 Year10
City1    3.0   4.0     5   4.0 3.000000  3.50   4.0   4.5     5     NA
City2    6.0   5.5     3   3.0 5.500000  4.00   6.0   6.0     6      7
City3    4.5   7.0     3   8.0 8.000000  3.50   5.5   8.0     7      5
City4    3.0   5.0     3   8.0 6.666667  3.00   5.0   9.0     8      7
City5    4.0   6.0     6   8.0 5.333333  8.00   6.5   7.0     1      7
City6    6.5   4.5     7   8.0 4.000000  6.75   8.0   3.0     6      7
City7    9.0   3.0     8   8.0 4.500000  5.50   8.0   4.0     5     NA
City8    5.0   6.0     9   8.0 5.000000  4.25   8.0   1.0     4     NA
City9    7.5    NA     6   4.5 3.000000  3.00   8.0   7.5     7     NA
City10    NA    NA    NA    NA       NA    NA    NA    NA    NA      1

The function returns the first column's current value if not NA, NA if there aren't two values to extrapolate from, and the extrapolated value otherwise.

Upvotes: 1

Related Questions