drmariod
drmariod

Reputation: 11762

Replace NA with random numbers within `group_by` in dplyr

I have a data frame in long format and I want to replace missing values by random numbers, but I want to do this group wise with different settings...

library(dplyr)

set.seed(1)

imp_df <- 
  data.frame(exp=rep(letters[1:3], each=2),
             rep=1:2,
             mean=1:6,
             sd=seq(0,0.5,0.1))

df <- 
  data.frame(
    exp=rep(letters[1:3], each=20),
    rep=1:2,
    int=rnorm(60,10,5)
  )

df[sample(1:60,25,replace=F), 'int'] <- NA

So my data looks like above, in the imp_df I have the settings for the rnorm function based on the experiment exp and the replicate rep.

My data frame has then some missing values and I want to replace the NA by the random numbers.

How can I do it using dplyr or tidyr?

Edit

After the answer from @starja, I found a quick, but maybe slow solution by using rowwise together with left_join.

df %>%
  left_join(imp_df) %>%
  rowwise() %>%
  mutate(imp.int=if_else(
    is.na(int),
    rnorm(1, mean, sd),
    int
  )) %>%
  print(n=60)

Are there other ways to do this?

Edit 2

Since the rowwise approach is pretty slow and I couldn't get it running within some dplyr code, I used a for loop to go through imp_df with the imputation settings.

This is a pretty quick solution, but not as readable as I was hoping:

df$imp.int <- df$int


for(line in 1:nrow(imp_df)) {
  imp_settings <- as.list(imp_df[line,])
  rows_missing_values <- which(
    df$exp == imp_settings$exp &
      df$rep == imp_settings$rep &
      is.na(df$imp.int) 
  )
  df$imp.int[rows_missing_values] <- 
    stats::rnorm(length(rows_missing_values), imp_settings$mean, imp_settings$sd)
}

So we first add a column imp.int for the imputed values and run now line by line the different imputation settings by replacing the NAs for each group.

Upvotes: 0

Views: 635

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21938

This could also be done:

library(dplyr)
library(purrr)

df %>%
  left_join(imp_df, by = c("exp", "rep")) %>%
  mutate(int = ifelse(is.na(int), 
                      map2(mean, sd, ~ rnorm(1, .x, .y)), int))

   exp rep       int mean  sd
1    a   1         1    1 0.0
2    a   2  10.91822    2 0.1
3    a   1  5.821857    1 0.0
4    a   2   17.9764    2 0.1
5    a   1  11.64754    1 0.0
6    a   2  5.897658    2 0.1
7    a   1  12.43715    1 0.0
8    a   2  13.69162    2 0.1
9    a   1  12.87891    1 0.0
10   a   2  1.986482    2 0.1

Upvotes: 3

starja
starja

Reputation: 10365

I guess there are cleverer solutions out there that use vectorisation, but if you don't have super large data, I like to use a purrr::map function for this together with a small custom made function:

library(dplyr)

set.seed(1)

imp_df <- 
  data.frame(exp=rep(letters[1:3], each=2),
             rep=1:2,
             mean=1:6,
             sd=seq(0,0.5,0.1))

df <- 
  data.frame(
    exp=rep(letters[1:3], each=20),
    rep=1:2,
    int=rnorm(60,10,5)
  )

df[sample(1:60,25,replace=F), 'int'] <- NA

replace_fun <- function(x, mean, sd) {
  if (is.na(x)) {
    rnorm(1, mean, sd)
  } else {
    x
  }
}

df %>% 
  left_join(imp_df, by = c("exp", "rep")) %>% 
  mutate(int = purrr::pmap_dbl(list(int, mean, sd), replace_fun)) %>% 
  head()
#>   exp rep       int mean  sd
#> 1   a   1  1.000000    1 0.0
#> 2   a   2 10.918217    2 0.1
#> 3   a   1  5.821857    1 0.0
#> 4   a   2 17.976404    2 0.1
#> 5   a   1 11.647539    1 0.0
#> 6   a   2  5.897658    2 0.1

Created on 2021-05-27 by the reprex package (v0.3.0)

(If you want, you can remove the mean/sd columns with select(-c(mean, sd)).)

Upvotes: 1

Related Questions