Max Wfhde
Max Wfhde

Reputation: 21

Imputing based on percentage of NA values

I want to impute temperature values from 6 different weather stations. The data are measured every 30 minutes. I want to impute the values only when there are more than 20 % NA values in a day and month. So I am grouping the values per date/month, calculate the mean of NAs per date/month and then I want to filter to keep the days/months which have less than 20 % NA in order to impute on the rest. What is the best way to do that? I have problems coding the filter, because I am not really sure if it filters the way I want it. Also what is the best method to impute the missing values later on? I tried to familarize myself with the imputeTS package, but I am not sure which method I should be using. na_seadec or na_seasplit or something else?

My data (sample, created with slice_sample, n=20 from the dplyr package)

df <- structure(list(td = structure(c(1591601400, 1586611800, 1574420400, 
1583326800, 1568898000, 1561969800, 1577010600, 1598238000, 1593968400, 
1567800000, 1590967800, 1584981000, 1563597000, 1589117400, 1599796800, 
1563467400, 1569819600, 1571014800, 1573320600, 1577154600), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), Temp_Dede = c(13.7, NA, NA, 6.4, 14.9, 19.1, 1.3, 
14.2, 21.1, 15.1, 10, 5, 14.1, 24.2, 8.8, 25.3, 14.9, 19.7, NA, 
6.2), Temp_188 = c(13.1, 12.6, 8.9, 6.3, 14.5, 18.8, 1.4, 14.2, 
20.9, 13.1, 10.4, 5.1, 12.2, 24.2, 9.4, 25.9, 14.8, 18.9, NA, 
6.1), Temp_275 = c(13.9, 12.6, 8.8, 6, 14.3, 18.9, 1.4, 13.5, 
20.4, 12.2, 11.1, 4.6, 12.5, 23.3, 9.9, 24, 14.8, 19.2, 6.9, 
5.9), Temp_807 = c(13.9, 13.1, 8.8, 6.2, 14.3, 19.1, 1.4, 14.7, 
20.5, 13.3, 10.6, 4.9, 12.8, 23.1, 10.3, 24.8, 14.7, 19.1, 6.9, 
6.1), Temp_1189 = c(13.7, 12.3, 8.8, 5.6, 14.1, 18.4, 1.4, 13.3, 
19.9, 13.3, 10.7, 4.4, 13.6, 24, 9.8, 24.9, 14.7, 19.1, 6.9, 
5.7), Temp_1599 = c(13.2, 12.7, 8.8, 5.1, 14.3, 18.3, 1.8, 14.2, 
20.3, 13.2, 10.6, 4.4, 12.1, 22.9, 9.8, 25.8, 14.8, 19.2, 6.9, 
5.9)), row.names = c(NA, -20L), class = "data.frame")

The code I've been using so far. I am only grouping by days in the first step. There are some months of the data which have several complete days missing, so I need to filter months with > 20 % NAs after that.

df  %>% group_by(Datum) %>% 
            filter_at(vars(Temp_Dede, Temp_188, Temp_275, Temp_807, Temp_1189, Temp_1599),~mean(is.na(.) <0.2))

I am not sure what to do next and I am stuck.

Upvotes: 2

Views: 91

Answers (0)

Related Questions