Reputation: 33
Is there a simple way to linearly extrapolate missing values in an R data frame?
Maybe this is a trivial and often encountered problem in data preprocessing, however, after searching for quite a while, I could not find any straightforward solution.
This question is NOT about interpolation but rather about extrapolating missing data before and after occurrences.
Using user defined functions, this problem is solvable I know, but I am afraid that would be unnecessary in this case.
Below is the starting input data frame, including the desired output format.
Any help/hint is highly appreciated. Thank you very much in advance.
Input data frame:
input <- read.table(header=TRUE, text="
ID1 ID2 ID3 ID4 ID5 ID6
NA 20 NA NA NA NA
21 21 NA NA 22 NA
22 22 23 24 23 22
NA 23 24 25 NA 23
NA 24 25 26 NA 24
NA 25 26 27 NA 25
NA 26 27 28 NA 26
NA NA 28 NA NA 27
NA NA NA NA NA NA
NA NA NA NA NA NA
")
Output data frame:
output <- read.table(header=TRUE, text="
ID1 ID2 ID3 ID4 ID5 ID6
20 20 21 22 21 20
21 21 22 23 22 21
22 22 23 24 23 22
23 23 24 25 24 23
24 24 25 26 25 24
25 25 26 27 26 25
26 26 27 28 27 26
27 27 28 29 28 27
28 28 29 30 29 28
29 29 30 31 30 29
")
Upvotes: 3
Views: 2663
Reputation: 6234
A possible approach is to use Hmisc::approxExtrap
, base R's approx
function does not support linear extrapolation only linear interpolation:
input[] <- sapply(input, function(y) {
xs <- seq_along(y)
Hmisc::approxExtrap(x = xs[!is.na(y)], y = y[!is.na(y)], xout = xs)$y
})
input
#> ID1 ID2 ID3 ID4 ID5 ID6
#> 1 20 20 21 22 21 20
#> 2 21 21 22 23 22 21
#> 3 22 22 23 24 23 22
#> 4 23 23 24 25 24 23
#> 5 24 24 25 26 25 24
#> 6 25 25 26 27 26 25
#> 7 26 26 27 28 27 26
#> 8 27 27 28 29 28 27
#> 9 28 28 29 30 29 28
#> 10 29 29 30 31 30 29
Note that we could always calculate the (least-squares) estimated coefficients of a linear model by hand:
input[] <- sapply(input, function(y) {
xs <- seq_along(y)[!is.na(y)]
ys <- y[!is.na(y)]
b.hat <- cov(xs, ys) / var(xs)
a.hat <- mean(ys) - b.hat * mean(xs)
a.hat + b.hat * seq_along(y)
})
input
#> ID1 ID2 ID3 ID4 ID5 ID6
#> 1 20 20 21 22 21 20
#> 2 21 21 22 23 22 21
#> 3 22 22 23 24 23 22
#> 4 23 23 24 25 24 23
#> 5 24 24 25 26 25 24
#> 6 25 25 26 27 26 25
#> 7 26 26 27 28 27 26
#> 8 27 27 28 29 28 27
#> 9 28 28 29 30 29 28
#> 10 29 29 30 31 30 29
Upvotes: 7