Reputation: 33
I have a large panel data with 1000s of rows. I want to use group by (gvkey) and impute values for NAs but some groups have all NAs. I want to ignore those groups.
These lines of code give me what I seek
set.seed(123)
fake_data <- data.frame(
gvkey = rep(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), each = num_years),
year = rep(2010:2014, 10),
dltt = rnorm(50))
for (gvkey in c("A", "B", "D", "E", "F", "G", "H", "I", "J")) {
year_to_replace <- sample(c(2011, 2012, 2013), size = sample(2:3, 1), replace = FALSE)
fake_data$dltt[fake_data$gvkey == gvkey & fake_data$year %in% year_to_replace] <- NA
}
fake_data <- fake_data %>%
arrange(gvkey, year) %>%
group_by(gvkey) %>%
mutate(dltt_imputed = na.approx(dltt))
But I get an error if some group has all NAs
fake_data$dltt[fake_data$gvkey == "C"] <- NA
fake_data <- fake_data %>%
arrange(gvkey, year) %>%
group_by(gvkey) %>%
mutate(dltt_imputed = na.approx(dltt))
Please would someone help me add some conditions to the ongoing pipe to ignore such groups
Upvotes: 0
Views: 59
Reputation: 2140
One option is to provide a condition for mutate()
to ignore groups with complete missing values (here, group C)
, and approximate missing values using corresponding non-missing values of related group. Since the num-years
is not provided in your question, I assume it as num_years=5
based on the total number of values (50)
.
library(zoo)
library(tidyverse)
num_years <- 5
fake_data$dltt[fake_data$gvkey == "C"] <- NA
fake_data <- fake_data %>%
arrange(gvkey, year) %>%
group_by(gvkey) %>%
mutate(dltt_imputed = ifelse(sum(is.na(dltt)) == num_years | !(is.na(dltt)), dltt, na.approx(dltt)))
Note that the new imputed column would contain group C
and doesn't exclude such groups with complete missing values. So I would leave this to QO how to proceed later with complete missing groups in the new imputed column.
Upvotes: 0