Reputation: 33
I have a large dataset with a lot of missing values and I want to impute it by group "name" either linearly or with moving average.
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
> d
name year V W X Y Z
1 a 1 NA 11 11 11 NA
2 a 2 21 NA 21 21 NA
3 a 3 31 31 NA 31 31
4 a 4 41 41 41 NA 41
5 b 1 11 11 NA NA 11
6 b 2 NA 21 21 21 NA
7 b 3 NA NA NA 31 31
8 b 4 41 NA 41 NA NA
9 c 1 NA NA 11 NA NA
10 c 2 NA NA NA 21 NA
11 c 3 NA 31 NA NA NA
12 c 4 41 NA NA NA NA
Hopefully the results can be as closed as the following:
name year V W X Y Z
1 a 1 11 11 11 11 11
2 a 2 21 21 21 21 21
3 a 3 31 31 31 31 31
4 a 4 41 41 41 41 41
5 b 1 11 11 11 11 11
6 b 2 21 21 21 21 21
7 b 3 31 31 31 31 31
8 b 4 41 41 41 41 41
9 c 1 11 11 11 11 NA
10 c 2 21 21 21 21 NA
11 c 3 31 31 31 31 NA
12 c 4 41 41 41 41 NA
I found this and this. Tried the following without groupby but it didn't work:
data.frame(lapply(d, function(X) approxfun(seq_along(X), X)(seq_along(X))))
imputeTS::na_ma(d, k = 2, weighting = "simple")
The first one gave an error as below:
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
In addition: Warning message:
In xy.coords(x, y, setLab = FALSE) :
Error in approxfun(seq_along(X), X) :
need at least two non-NA values to interpolate
So I tried the second one and it keep loading for a long time and nothing happened. According to the reply from the first link,
the package requires time series/vector input (that's why each column has to be called separately).
Any help is greatly appreciated!
Upvotes: 0
Views: 889
Reputation: 7730
One issue I see is, that some of the series you want to impute have only 1 non-NA value, thus na_ma
or na_interpolation
from imputeTS or also other packages can not be applied successfully, since these require at least 2 non-NA values.
That is why in this solution I created a impute_select
function for you, that let's you choose, what to to when > 1
values or present, when exactly == 1
values are present or when there are only NAs.
In this case, when > 1 values is present, it uses na_ma
, but you could also use na_interpoltion or any other imputation function from imputeTS here.
When only 1 value is present, it uses na_locf
since this method also works with only 1 value in the series.
When no non-NA values are in the series, it uses na_replace, just replacing all the NAs with a default value (I just set it to 11)
By adjusting this function you should be able to individually adjust the imputation for different amounts of NAs in the series.
library("imputeTS")
d <- data.frame(
name = c('a', 'a','a','a','b','b','b','b','c','c','c','c'),
year = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
V = c(NA, 21, 31, 41, 11, NA, NA, 41, NA, NA, NA, 41),
W = c(11, NA, 31, 41, 11, 21, NA, NA, NA, NA, 31, NA),
X = c(11, 21, NA, 41, NA, 21, NA, 41, 11, NA, NA, NA),
Y = c(11, 21, 31, NA, NA, 21, 31, NA, NA, 21, NA, NA),
Z = c(NA, NA, 31, 41, 11, NA, 31, NA, NA, NA, NA, NA)
)
impute_select <- function(x) {
# select a method to use when more than 1 values are available
if (sum(!is.na(x)) > 1) {
result <- na_ma(x)
}
# Select value when only 1 value is in series
if (sum(!is.na(x)) == 1) {
result <- na_locf(x)
}
# Select method, when no non-NA value is present
else {
result <- na_replace(x, 11)
}
}
# This code is to apply the function row-wise to your data frame
# Since usually the imputation would happen column-wise instead
d[,3:7] <- t(apply(d[,3:7], MARGIN =1, FUN = impute_select))
d
This are the results (hopefully exactly what you wanted):
name year V W X Y Z 1 a 1 11 11 11 11 11 2 a 2 21 11 21 21 11 3 a 3 31 31 11 31 31 4 a 4 41 41 41 11 41 5 b 1 11 11 11 11 11 6 b 2 11 21 21 21 11 7 b 3 11 11 11 31 31 8 b 4 41 11 41 11 11 9 c 1 11 11 11 11 11 10 c 2 21 21 21 21 21 11 c 3 31 31 31 31 31 12 c 4 41 41 41 41 41
Upvotes: 1
Reputation: 388962
You can use zoo::na.spline
-
library(dplyr)
d %>%
group_by(name) %>%
mutate(across(V:Z, zoo::na.spline, na.rm = FALSE)) %>%
ungroup
# name year V W X Y Z
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 11 11 11 11 11
# 2 a 2 21 21 21 21 21
# 3 a 3 31 31 31 31 31
# 4 a 4 41 41 41 41 41
# 5 b 1 11 11 11 11 11
# 6 b 2 21 21 21 21 21
# 7 b 3 31 31 31 31 31
# 8 b 4 41 41 41 41 41
# 9 c 1 41 31 11 21 NA
#10 c 2 41 31 11 21 NA
#11 c 3 41 31 11 21 NA
#12 c 4 41 31 11 21 NA
For name
, "c"
I think it would be difficult to impute the missing values only from 1 number.
Upvotes: 3