Reputation: 281
I want to interpolate missing values using dplyr, piping, and spline()
.
Data:
test <- structure(list(site = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("lake", "stream", "wetland"
), class = "factor"), depth = c(0L, -3L, -4L, -8L, -10L, -14L,
0L, -1L, -3L, -5L, 0L, -2L, -4L, -6L), var1 = c(NA, 1L, 3L, NA,
6L, NA, 1L, 2L, NA, 4L, 1L, NA, NA, 4L), var2 = c(1L, NA, 3L,
4L, 8L, NA, NA, NA, NA, NA, NA, 2L, NA, NA)), .Names = c("site",
"depth", "var1", "var2"), class = "data.frame", row.names = c(NA,
-14L))
Q1: How do I use the following functioning code, but limit the range of interpolation to occur between the first non-NA
value and the last non-NA
value for each variable. For example, it should only interpolate var1
for wetland
at depth -8
and return NA
for depths 0
and -14
.
library(tidyverse)
test_int <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs("i" = if(sum(!is.na(.)) > 1)
spline(x=depth, y=., xout=depth)[["y"]]
else
NA))
Q2: Is there a way to bound my interpolated values from 0
to Inf
? Or is this not appropriate with spline (e.g., I should use another interpolation method such as smooth
or loess
)?
Upvotes: 3
Views: 571
Reputation: 14988
Not pretty, but capable of filtering out the excess values.
Side effect is that it filters out interpolated values beyond the min
and max
limits as well.
test_clean <-
test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs(c("c" = if(sum(!is.na(.)) > 1)
spline(x=depth, y=., xout=depth)[["y"]]
else NA),
"min" = min(., na.rm = TRUE),
"max" = max(., na.rm = TRUE)
)
) %>%
mutate(var1_i = if_else(var1_c >= var1_min & var1_c <= var1_max, var1_c, NA_real_),
var2_i = if_else(var2_c >= var2_min & var2_c <= var2_max, var2_c, NA_real_)) %>%
select(site:var2, ends_with("i"))
test_clean
# A tibble: 14 x 6
# Groups: site [3]
site depth var1 var2 var1_i var2_i
<fctr> <int> <int> <int> <dbl> <dbl>
1 wetland 0 NA 1 NA 1.000000
2 wetland -3 1 NA 1.0 3.078125
3 wetland -4 3 3 3.0 3.000000
4 wetland -8 NA 4 NA 4.000000
5 wetland -10 6 8 6.0 8.000000
6 wetland -14 NA NA NA NA
7 lake 0 1 NA 1.0 NA
8 lake -1 2 NA 2.0 NA
9 lake -3 NA NA 3.4 NA
10 lake -5 4 NA 4.0 NA
11 stream 0 1 NA 1.0 NA
12 stream -2 NA 2 2.0 NA
13 stream -4 NA NA 3.0 NA
14 stream -6 4 NA 4.0 NA
and to help everyone working on improving this or proofing the steps that took place on the way to the final dataframe, here's the dataframe with the intermediate steps included:
# A tibble: 14 x 12
# Groups: site [3]
site depth var1 var2 var1_c var2_c var1_min var2_min var1_max var2_max var1_i var2_i
<fctr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 wetland 0 NA 1 -7.5714286 1.000000 1 1 6 8 NA 1.000000
2 wetland -3 1 NA 1.0000000 3.078125 1 1 6 8 1.0 3.078125
3 wetland -4 3 3 3.0000000 3.000000 1 1 6 8 3.0 3.000000
4 wetland -8 NA 4 6.7142857 4.000000 1 1 6 8 NA 4.000000
5 wetland -10 6 8 6.0000000 8.000000 1 1 6 8 6.0 8.000000
6 wetland -14 NA NA -0.5714286 30.750000 1 1 6 8 NA NA
7 lake 0 1 NA 1.0000000 NA 1 Inf 4 -Inf 1.0 NA
8 lake -1 2 NA 2.0000000 NA 1 Inf 4 -Inf 2.0 NA
9 lake -3 NA NA 3.4000000 NA 1 Inf 4 -Inf 3.4 NA
10 lake -5 4 NA 4.0000000 NA 1 Inf 4 -Inf 4.0 NA
11 stream 0 1 NA 1.0000000 NA 1 2 4 2 1.0 NA
12 stream -2 NA 2 2.0000000 NA 1 2 4 2 2.0 NA
13 stream -4 NA NA 3.0000000 NA 1 2 4 2 3.0 NA
14 stream -6 4 NA 4.0000000 NA 1 2 4 2 4.0 NA
Upvotes: 1