Groupwise replace NA's with sampled value from non NA's using dplyr

I have a dataframe with missing NA values in column X1 and a grouping variable group. I want to replace all NA values with a value sampled from the non-NA values of that group. This should be done for all groups except for one (group==C). For this conditional replacement with resample data I tried if/else and case_when within the mutate command of dplyr, however without success. I guess this is because the TRUE and FALSE are both evaluated before assessing the condition. (The case_when condition works and selects the appropriate cases as shown when calculating X2, however using the sample command causes a problem.)

#Original dataframe
df <- 
      data.frame(
      id = 1:10,
      group = c(rep("A",5),rep("B",4),"C"),
      X1 = c(NA, 2, 1, NA,4, 3, NA, 8, 9, NA))%>%
      group_by(group)%>%
      mutate(X2 = case_when(is.na(X1)&group!="C"~3,
                                             TRUE~2))
    
    # Approach with if else (doesn't work)
    df%>%
      mutate(X3 = if(is.na(X1)&group!="C") sample(X1[!is.na(X1)],size=n(), replace = TRUE) else X1)

    # Approach with case_when  (doesn't work either)
    df%>%
      mutate(X3 = case_when(is.na(X1)&group!="C"~
                              ~sample(X1[!is.na(X1)],size=n(), replace = TRUE),
                            TRUE~X1))

Upvotes: 0

Answers (2)

Ronak Shah

Reputation: 389215

You are right that case_when or if_else eagerly evaluates all the conditions irrespective of the condition that is TRUE.

To avoid using case_when or if_else you can write a function that replace NA values in a group with other non-NA values from the same group. This function can be called for each group except "C" which can be achieved using if/else.

library(dplyr)

sample_NA_func <- function(x) {
  inds <- is.na(x)
  x[inds] <- sample(x[!inds], size = sum(inds), replace = TRUE)
  x
}

set.seed(2024)
df%>%
  mutate(X3 = if(all(group!="C")) sample_NA_func(X1) else X1, .by = group)

#   id group X1 X3
#1   1     A NA  1
#2   2     A  2  2
#3   3     A  1  1
#4   4     A NA  2
#5   5     A  4  4
#6   6     B  3  3
#7   7     B NA  3
#8   8     B  8  8
#9   9     B  9  9
#10 10     C NA NA

data

df <- data.frame(
        id = 1:10,
        group = c(rep("A",5),rep("B",4),"C"),
        X1 = c(NA, 2, 1, NA,4, 3, NA, 8, 9, NA))

Upvotes: 1

Limey

Reputation: 12586

Here's a tidyverse solution.

library(tidyverse)

df %>% 
  group_by(group) %>% 
  group_modify(
    function(.x, .y) {
      if (.y$group == "C") {
        .x
      } else {
        nonMissing <- .x %>% filter(!is.na(X1)) %>% pull(X1)
        .x %>% 
          mutate(
            Temp = sample(nonMissing, nrow(.x), replace = TRUE),
            X2 = ifelse(is.na(X1), Temp, X1)
          ) %>% 
          select(-Temp)
      }
    }
  )
# A tibble: 10 × 4
# Groups:   group [3]
   group    id    X1    X2
   <chr> <int> <dbl> <dbl>
 1 A         1    NA     2
 2 A         2     2     2
 3 A         3     1     1
 4 A         4    NA     1
 5 A         5     4     4
 6 B         6     3     3
 7 B         7    NA     3
 8 B         8     8     8
 9 B         9     9     9
10 C        10    NA    NA

group_modify applies its argument (a function) to each group of a grouped data frame and returns the modified daya frame. The function takes two arguments, conventionally .x and .y. .x contains the data for the current group. .y contains the grouping columns and a single row whose values define the current group.

The function simply returns the current group when group equals "C". Otherwise, the non missing values are extracted from X1 and a random vector whose length is equal to the number of rows in the current group is created. Then X2 is constructed from X1 or Temp as appropriate. Temp is then deleted.

The latter process could be written in a more compact fashion, but I think this longer version is easier to understand.

Test data

df <- 
  data.frame(
    id = 1:10,
    group = c(rep("A",5),rep("B",4),"C"),
    X1 = c(NA, 2, 1, NA,4, 3, NA, 8, 9, NA))

Upvotes: 1

Groupwise replace NA&#39;s with sampled value from non NA&#39;s using dplyr

Answers (2)

Related Questions

Groupwise replace NA's with sampled value from non NA's using dplyr