Conditionally replace values in rows using dplyr

Question

I'm have a data.frame with variables that are indexed by group and year like so:

library(tidyverse)

set.seed(8675309)

df <- data.frame(
  year = rep(1991:2000, 10), 
  groups = rep(1:10, each = 10), 
  var1 = rnorm(100), 
  var2 = rnorm(100)
)

head(df)

  year groups       var1        var2
1 1991      1 -0.9965824  0.74453768
2 1992      1  0.7218241 -1.34662801
3 1993      1 -0.6172088  0.33014251
4 1994      1  2.0293916 -0.01272533
5 1995      1  1.0654161 -0.46367596
6 1996      1  0.9872197  0.20494209

where some of the observations are missing for a specific year, say, 1996:

df[df$year == 1996, ]$var1 <- ifelse(df[df$year == 1996, ]$var1 > 0,
                                    NA, df[df$year == 1996, ]$var1)
## If 1996 is missing in var1, it is missing in all vars:
df$var2 <- ifelse(is.na(df$var1), NA, df$var2)

My question is, how can I replace the values of var1 and var2 conditional on whether or not they already exist? This is the gist of what I want:

df %>%
  group_by(groups) %>%
  mutate_all(funs(replace_1996_if_NA_with_value_from_1994))

Mark Peterson · Accepted Answer

Your question makes this unclear, but if you have some default value that you always want to use to replace a missing value (e.g., if 1994 is your baseline), then I would recommend that you first generate those defaults:

defaultValues <-
  df %>%
  filter(year == 1994) %>%
  select(groups
         , default_var1 = var1
         , default_var2 = var2)

Then, use left_join to merge on the groups. That way, each row will now also have a default. You can then use coalesce to pick the first non-NA value -- which will be the default if and only if the value is missing. End by cleaning away the default values.

df %>%
  left_join(defaultValues) %>%
  mutate(var1 = coalesce(var1, default_var1)
         , var2 = coalesce(var2, default_var2)) %>%
  select(-starts_with("default"))

If your defaults are more complex, you would just need to construct them to match your desired behavior. For example, if you want it to fill in the value from two years prior, use:

complex_defaultValues <-
  df %>%
  mutate(year = year + 2) %>%
  rename(default_var1 = var1
         , default_var2 = var2)

then, join on both year and group, and it will correctly align (though note that if the value from two years ago are missing, it will still be missing after coalesce. So, you may need to account for the missings in your defaults as well.)

Finally, if you just want to propagate the last non-NA value forward (instead of trying to go back two years, or always using the same default), you can use fill from tidyr:

df %>%
  group_by(groups) %>%
  fill(var1, var2)

Which will automatically fill down (so make sure your data are sorted in the way you want)

Conditionally replace values in rows using dplyr

Answers (2)

Related Questions