lrdbs
lrdbs

Reputation: 33

R linear extrapolate missing values

Is there a simple way to linearly extrapolate missing values in an R data frame?

Maybe this is a trivial and often encountered problem in data preprocessing, however, after searching for quite a while, I could not find any straightforward solution.

This question is NOT about interpolation but rather about extrapolating missing data before and after occurrences.

Using user defined functions, this problem is solvable I know, but I am afraid that would be unnecessary in this case.

Below is the starting input data frame, including the desired output format.

Any help/hint is highly appreciated. Thank you very much in advance.

Input data frame:

input <- read.table(header=TRUE, text="
  ID1 ID2 ID3 ID4 ID5 ID6
  NA  20  NA  NA  NA  NA
  21  21  NA  NA  22  NA
  22  22  23  24  23  22
  NA  23  24  25  NA  23
  NA  24  25  26  NA  24
  NA  25  26  27  NA  25
  NA  26  27  28  NA  26
  NA  NA  28  NA  NA  27
  NA  NA  NA  NA  NA  NA
  NA  NA  NA  NA  NA  NA
")

Output data frame:

output <- read.table(header=TRUE, text="
  ID1 ID2 ID3 ID4 ID5 ID6
  20  20  21  22  21  20
  21  21  22  23  22  21
  22  22  23  24  23  22
  23  23  24  25  24  23
  24  24  25  26  25  24
  25  25  26  27  26  25
  26  26  27  28  27  26
  27  27  28  29  28  27
  28  28  29  30  29  28
  29  29  30  31  30  29
")

Upvotes: 3

Views: 2663

Answers (1)

Joris C.
Joris C.

Reputation: 6234

A possible approach is to use Hmisc::approxExtrap, base R's approx function does not support linear extrapolation only linear interpolation:

input[] <- sapply(input, function(y) {
      xs <- seq_along(y)
      Hmisc::approxExtrap(x = xs[!is.na(y)], y = y[!is.na(y)], xout = xs)$y
    })

input
#>    ID1 ID2 ID3 ID4 ID5 ID6
#> 1   20  20  21  22  21  20
#> 2   21  21  22  23  22  21
#> 3   22  22  23  24  23  22
#> 4   23  23  24  25  24  23
#> 5   24  24  25  26  25  24
#> 6   25  25  26  27  26  25
#> 7   26  26  27  28  27  26
#> 8   27  27  28  29  28  27
#> 9   28  28  29  30  29  28
#> 10  29  29  30  31  30  29

Note that we could always calculate the (least-squares) estimated coefficients of a linear model by hand:

input[] <- sapply(input, function(y) {

      xs <- seq_along(y)[!is.na(y)]
      ys <- y[!is.na(y)]

      b.hat <- cov(xs, ys) / var(xs)
      a.hat <- mean(ys) - b.hat * mean(xs)

      a.hat + b.hat * seq_along(y)

    })

input
#>    ID1 ID2 ID3 ID4 ID5 ID6
#> 1   20  20  21  22  21  20
#> 2   21  21  22  23  22  21
#> 3   22  22  23  24  23  22
#> 4   23  23  24  25  24  23
#> 5   24  24  25  26  25  24
#> 6   25  25  26  27  26  25
#> 7   26  26  27  28  27  26
#> 8   27  27  28  29  28  27
#> 9   28  28  29  30  29  28
#> 10  29  29  30  31  30  29

Upvotes: 7

Related Questions