Alex
Alex

Reputation: 2077

Conditionally replace values in rows using dplyr

I'm have a data.frame with variables that are indexed by group and year like so:

library(tidyverse)

set.seed(8675309)

df <- data.frame(
  year = rep(1991:2000, 10), 
  groups = rep(1:10, each = 10), 
  var1 = rnorm(100), 
  var2 = rnorm(100)
)

head(df)

  year groups       var1        var2
1 1991      1 -0.9965824  0.74453768
2 1992      1  0.7218241 -1.34662801
3 1993      1 -0.6172088  0.33014251
4 1994      1  2.0293916 -0.01272533
5 1995      1  1.0654161 -0.46367596
6 1996      1  0.9872197  0.20494209

where some of the observations are missing for a specific year, say, 1996:

df[df$year == 1996, ]$var1 <- ifelse(df[df$year == 1996, ]$var1 > 0,
                                    NA, df[df$year == 1996, ]$var1)
## If 1996 is missing in var1, it is missing in all vars:
df$var2 <- ifelse(is.na(df$var1), NA, df$var2)

My question is, how can I replace the values of var1 and var2 conditional on whether or not they already exist? This is the gist of what I want:

df %>%
  group_by(groups) %>%
  mutate_all(funs(replace_1996_if_NA_with_value_from_1994))

Upvotes: 0

Views: 3985

Answers (2)

Megatron
Megatron

Reputation: 17089

Since it's unclear how you'd like to replace missing values, I replace them using mean imputation (taking the mean of the column and using that to replace the value).

# Some of the observations are now missing
n <- 10
df[cbind(sample(1:nrow(df), n, replace=T), sample(1:ncol(df), n, replace=T))] <- NA

We extract the rows containing NA's

df[rowSums(is.na(df)) > 0,]
#    year groups        var1       var2
# 5  1995      1          NA -0.4636760
# 14 1994      2          NA  1.1556394
# 34 1994     NA  0.58852729 -0.7053416
# 37 1997      4  0.06391704         NA
# 47 1997     NA -0.87493144  1.1691501
# 50 2000      5  0.03609091         NA
# 54 1994     NA -2.13523626 -1.0991012
# 80 2000      8 -1.35752606         NA
# 84   NA      9  0.02038586 -1.6054171
# 92 1992     NA  0.59155773 -1.768570

Replace with means using dplyr's mutate_each()

newDF <- mutate_each(df, funs(ifelse(is.na(.), mean(., na.rm=T), .)))

Updated columns:

newDF[rowSums(is.na(df)) > 0,]

       year  groups        var1        var2
# 5  1995.000 1.00000  0.04923291 -0.46367596
# 14 1994.000 2.00000  0.04923291  1.15563940
# 34 1994.000 5.46875  0.58852729 -0.70534164
# 37 1997.000 4.00000  0.06391704 -0.04406217
# 47 1997.000 5.46875 -0.87493144  1.16915008
# 50 2000.000 5.00000  0.03609091 -0.04406217
# 54 1994.000 5.46875 -2.13523626 -1.09910122
# 80 2000.000 8.00000 -1.35752606 -0.04406217
# 84 1995.515 9.00000  0.02038586 -1.60541710
# 92 1992.000 5.46875  0.59155773 -1.76857084

Upvotes: 0

Mark Peterson
Mark Peterson

Reputation: 9560

Your question makes this unclear, but if you have some default value that you always want to use to replace a missing value (e.g., if 1994 is your baseline), then I would recommend that you first generate those defaults:

defaultValues <-
  df %>%
  filter(year == 1994) %>%
  select(groups
         , default_var1 = var1
         , default_var2 = var2)

Then, use left_join to merge on the groups. That way, each row will now also have a default. You can then use coalesce to pick the first non-NA value -- which will be the default if and only if the value is missing. End by cleaning away the default values.

df %>%
  left_join(defaultValues) %>%
  mutate(var1 = coalesce(var1, default_var1)
         , var2 = coalesce(var2, default_var2)) %>%
  select(-starts_with("default"))

If your defaults are more complex, you would just need to construct them to match your desired behavior. For example, if you want it to fill in the value from two years prior, use:

complex_defaultValues <-
  df %>%
  mutate(year = year + 2) %>%
  rename(default_var1 = var1
         , default_var2 = var2)

then, join on both year and group, and it will correctly align (though note that if the value from two years ago are missing, it will still be missing after coalesce. So, you may need to account for the missings in your defaults as well.)

Finally, if you just want to propagate the last non-NA value forward (instead of trying to go back two years, or always using the same default), you can use fill from tidyr:

df %>%
  group_by(groups) %>%
  fill(var1, var2)

Which will automatically fill down (so make sure your data are sorted in the way you want)

Upvotes: 1

Related Questions