Reputation: 1064
I have a dataframe with missing NA values in column X1
and a grouping variable group
. I want to replace all NA values with a value sampled from the non-NA values of that group. This should be done for all groups except for one (group==C
). For this conditional replacement with resample data I tried if/else
and case_when
within the mutate
command of dplyr
, however without success. I guess this is because the TRUE
and FALSE
are both evaluated before assessing the condition. (The case_when condition works and selects the appropriate cases as shown when calculating X2
, however using the sample command causes a problem.)
#Original dataframe
df <-
data.frame(
id = 1:10,
group = c(rep("A",5),rep("B",4),"C"),
X1 = c(NA, 2, 1, NA,4, 3, NA, 8, 9, NA))%>%
group_by(group)%>%
mutate(X2 = case_when(is.na(X1)&group!="C"~3,
TRUE~2))
# Approach with if else (doesn't work)
df%>%
mutate(X3 = if(is.na(X1)&group!="C") sample(X1[!is.na(X1)],size=n(), replace = TRUE) else X1)
# Approach with case_when (doesn't work either)
df%>%
mutate(X3 = case_when(is.na(X1)&group!="C"~
~sample(X1[!is.na(X1)],size=n(), replace = TRUE),
TRUE~X1))
Upvotes: 0
Views: 49
Reputation: 389215
You are right that case_when
or if_else
eagerly evaluates all the conditions irrespective of the condition that is TRUE
.
To avoid using case_when
or if_else
you can write a function that replace NA
values in a group with other non-NA values from the same group. This function can be called for each group except "C"
which can be achieved using if
/else
.
library(dplyr)
sample_NA_func <- function(x) {
inds <- is.na(x)
x[inds] <- sample(x[!inds], size = sum(inds), replace = TRUE)
x
}
set.seed(2024)
df%>%
mutate(X3 = if(all(group!="C")) sample_NA_func(X1) else X1, .by = group)
# id group X1 X3
#1 1 A NA 1
#2 2 A 2 2
#3 3 A 1 1
#4 4 A NA 2
#5 5 A 4 4
#6 6 B 3 3
#7 7 B NA 3
#8 8 B 8 8
#9 9 B 9 9
#10 10 C NA NA
data
df <- data.frame(
id = 1:10,
group = c(rep("A",5),rep("B",4),"C"),
X1 = c(NA, 2, 1, NA,4, 3, NA, 8, 9, NA))
Upvotes: 1
Reputation: 12586
Here's a tidyverse solution.
library(tidyverse)
df %>%
group_by(group) %>%
group_modify(
function(.x, .y) {
if (.y$group == "C") {
.x
} else {
nonMissing <- .x %>% filter(!is.na(X1)) %>% pull(X1)
.x %>%
mutate(
Temp = sample(nonMissing, nrow(.x), replace = TRUE),
X2 = ifelse(is.na(X1), Temp, X1)
) %>%
select(-Temp)
}
}
)
# A tibble: 10 × 4
# Groups: group [3]
group id X1 X2
<chr> <int> <dbl> <dbl>
1 A 1 NA 2
2 A 2 2 2
3 A 3 1 1
4 A 4 NA 1
5 A 5 4 4
6 B 6 3 3
7 B 7 NA 3
8 B 8 8 8
9 B 9 9 9
10 C 10 NA NA
group_modify
applies its argument (a function) to each group of a grouped data frame and returns the modified daya frame. The function takes two arguments, conventionally .x
and .y
. .x
contains the data for the current group. .y
contains the grouping columns and a single row whose values define the current group.
The function simply returns the current group when group
equals "C"
. Otherwise, the non missing values are extracted from X1
and a random vector whose length is equal to the number of rows in the current group is created. Then X2 is constructed from X1
or Temp
as appropriate. Temp
is then deleted.
The latter process could be written in a more compact fashion, but I think this longer version is easier to understand.
Test data
df <-
data.frame(
id = 1:10,
group = c(rep("A",5),rep("B",4),"C"),
X1 = c(NA, 2, 1, NA,4, 3, NA, 8, 9, NA))
Upvotes: 1