Reputation: 2077
I'm have a data.frame with variables that are indexed by group and year like so:
library(tidyverse)
set.seed(8675309)
df <- data.frame(
year = rep(1991:2000, 10),
groups = rep(1:10, each = 10),
var1 = rnorm(100),
var2 = rnorm(100)
)
head(df)
year groups var1 var2
1 1991 1 -0.9965824 0.74453768
2 1992 1 0.7218241 -1.34662801
3 1993 1 -0.6172088 0.33014251
4 1994 1 2.0293916 -0.01272533
5 1995 1 1.0654161 -0.46367596
6 1996 1 0.9872197 0.20494209
where some of the observations are missing for a specific year, say, 1996:
df[df$year == 1996, ]$var1 <- ifelse(df[df$year == 1996, ]$var1 > 0,
NA, df[df$year == 1996, ]$var1)
## If 1996 is missing in var1, it is missing in all vars:
df$var2 <- ifelse(is.na(df$var1), NA, df$var2)
My question is, how can I replace the values of var1
and var2
conditional on whether or not they already exist? This is the gist of what I want:
df %>%
group_by(groups) %>%
mutate_all(funs(replace_1996_if_NA_with_value_from_1994))
Upvotes: 0
Views: 3985
Reputation: 17089
Since it's unclear how you'd like to replace missing values, I replace them using mean imputation (taking the mean of the column and using that to replace the value).
# Some of the observations are now missing
n <- 10
df[cbind(sample(1:nrow(df), n, replace=T), sample(1:ncol(df), n, replace=T))] <- NA
We extract the rows containing NA
's
df[rowSums(is.na(df)) > 0,]
# year groups var1 var2
# 5 1995 1 NA -0.4636760
# 14 1994 2 NA 1.1556394
# 34 1994 NA 0.58852729 -0.7053416
# 37 1997 4 0.06391704 NA
# 47 1997 NA -0.87493144 1.1691501
# 50 2000 5 0.03609091 NA
# 54 1994 NA -2.13523626 -1.0991012
# 80 2000 8 -1.35752606 NA
# 84 NA 9 0.02038586 -1.6054171
# 92 1992 NA 0.59155773 -1.768570
Replace with means using dplyr
's mutate_each()
newDF <- mutate_each(df, funs(ifelse(is.na(.), mean(., na.rm=T), .)))
Updated columns:
newDF[rowSums(is.na(df)) > 0,]
year groups var1 var2
# 5 1995.000 1.00000 0.04923291 -0.46367596
# 14 1994.000 2.00000 0.04923291 1.15563940
# 34 1994.000 5.46875 0.58852729 -0.70534164
# 37 1997.000 4.00000 0.06391704 -0.04406217
# 47 1997.000 5.46875 -0.87493144 1.16915008
# 50 2000.000 5.00000 0.03609091 -0.04406217
# 54 1994.000 5.46875 -2.13523626 -1.09910122
# 80 2000.000 8.00000 -1.35752606 -0.04406217
# 84 1995.515 9.00000 0.02038586 -1.60541710
# 92 1992.000 5.46875 0.59155773 -1.76857084
Upvotes: 0
Reputation: 9560
Your question makes this unclear, but if you have some default value that you always want to use to replace a missing value (e.g., if 1994 is your baseline), then I would recommend that you first generate those defaults:
defaultValues <-
df %>%
filter(year == 1994) %>%
select(groups
, default_var1 = var1
, default_var2 = var2)
Then, use left_join
to merge on the groups. That way, each row will now also have a default. You can then use coalesce
to pick the first non-NA value -- which will be the default if and only if the value is missing. End by cleaning away the default values.
df %>%
left_join(defaultValues) %>%
mutate(var1 = coalesce(var1, default_var1)
, var2 = coalesce(var2, default_var2)) %>%
select(-starts_with("default"))
If your defaults are more complex, you would just need to construct them to match your desired behavior. For example, if you want it to fill in the value from two years prior, use:
complex_defaultValues <-
df %>%
mutate(year = year + 2) %>%
rename(default_var1 = var1
, default_var2 = var2)
then, join on both year and group, and it will correctly align (though note that if the value from two years ago are missing, it will still be missing after coalesce
. So, you may need to account for the missings in your defaults as well.)
Finally, if you just want to propagate the last non-NA value forward (instead of trying to go back two years, or always using the same default), you can use fill
from tidyr
:
df %>%
group_by(groups) %>%
fill(var1, var2)
Which will automatically fill down (so make sure your data are sorted in the way you want)
Upvotes: 1