Replace NA with random numbers within `group_by` in dplyr

Question

I have a data frame in long format and I want to replace missing values by random numbers, but I want to do this group wise with different settings...

library(dplyr)

set.seed(1)

imp_df <- 
  data.frame(exp=rep(letters[1:3], each=2),
             rep=1:2,
             mean=1:6,
             sd=seq(0,0.5,0.1))

df <- 
  data.frame(
    exp=rep(letters[1:3], each=20),
    rep=1:2,
    int=rnorm(60,10,5)
  )

df[sample(1:60,25,replace=F), 'int'] <- NA

So my data looks like above, in the imp_df I have the settings for the rnorm function based on the experiment exp and the replicate rep.

My data frame has then some missing values and I want to replace the NA by the random numbers.

How can I do it using dplyr or tidyr?

Edit

After the answer from @starja, I found a quick, but maybe slow solution by using rowwise together with left_join.

df %>%
  left_join(imp_df) %>%
  rowwise() %>%
  mutate(imp.int=if_else(
    is.na(int),
    rnorm(1, mean, sd),
    int
  )) %>%
  print(n=60)

Are there other ways to do this?

Edit 2

Since the rowwise approach is pretty slow and I couldn't get it running within some dplyr code, I used a for loop to go through imp_df with the imputation settings.

This is a pretty quick solution, but not as readable as I was hoping:

df$imp.int <- df$int


for(line in 1:nrow(imp_df)) {
  imp_settings <- as.list(imp_df[line,])
  rows_missing_values <- which(
    df$exp == imp_settings$exp &
      df$rep == imp_settings$rep &
      is.na(df$imp.int) 
  )
  df$imp.int[rows_missing_values] <- 
    stats::rnorm(length(rows_missing_values), imp_settings$mean, imp_settings$sd)
}

So we first add a column imp.int for the imputed values and run now line by line the different imputation settings by replacing the NAs for each group.

Anoushiravan R · Accepted Answer

This could also be done:

library(dplyr)
library(purrr)

df %>%
  left_join(imp_df, by = c("exp", "rep")) %>%
  mutate(int = ifelse(is.na(int), 
                      map2(mean, sd, ~ rnorm(1, .x, .y)), int))

   exp rep       int mean  sd
1    a   1         1    1 0.0
2    a   2  10.91822    2 0.1
3    a   1  5.821857    1 0.0
4    a   2   17.9764    2 0.1
5    a   1  11.64754    1 0.0
6    a   2  5.897658    2 0.1
7    a   1  12.43715    1 0.0
8    a   2  13.69162    2 0.1
9    a   1  12.87891    1 0.0
10   a   2  1.986482    2 0.1

Replace NA with random numbers within `group_by` in dplyr

Edit

Edit 2

Answers (2)

Related Questions