Eric Green
Eric Green

Reputation: 7725

randomly add NA values to dataframe with the proportion set by group

I would like to randomly add NA values to my dataframe with the proportion set by group.

library(tidyverse)
set.seed(1)
dat <- tibble(group = c(rep("A", 100),
                        rep("B", 100)),
              value = rnorm(200))

pA <- 0.5
pB <- 0.2

# does not work
# was trying to create another column that i could use with
# case_when to set value to NA if missing==1
dat %>%
  group_by(group) %>%
  mutate(missing = rbinom(n(), 1, c(pA, pB))) %>%
  summarise(mean = mean(missing))

Upvotes: 1

Views: 269

Answers (2)

dipetkov
dipetkov

Reputation: 3700

I'd create a small tibble to keep track of the expected missingness rates, and join it to the first data frame. Then go through row by row to decide whether to set a value to missing or not.

This is easy to generalize to more than two groups as well.

library("tidyverse")

set.seed(1)

dat <- tibble(
  group = c(
    rep("A", 100),
    rep("B", 100)
  ),
  value = rnorm(200)
)

expected_nans <- tibble(
  group = c("A", "B"),
  p = c(0.5, 0.2)
)

dat_with_nans <- dat %>%
  inner_join(
    expected_nans,
    by = "group"
  ) %>%
  mutate(
    r = runif(n()),
    value = if_else(r < p, NA_real_, value)
  ) %>%
  select(
    -p, -r
  )

dat_with_nans %>%
  group_by(
    group
  ) %>%
  summarise(
    mean(is.na(value))
  )
#> # A tibble: 2 × 2
#>   group `mean(is.na(value))`
#>   <chr>                <dbl>
#> 1 A                     0.53
#> 2 B                     0.17

Created on 2022-03-11 by the reprex package (v2.0.1)

Upvotes: 1

Eric Green
Eric Green

Reputation: 7725

Nesting and unnesting works

library(tidyverse)
dat <- tibble(group = c(rep("A", 1000),
                        rep("B", 1000)),
              value = rnorm(2000))

pA <- .1
pB <- 0.5

set.seed(1)
dat %>%
  group_by(group) %>%
  nest() %>%
  mutate(p = case_when(
    group=="A" ~ pA,
    TRUE ~ pB
  )) %>%
  mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>% 
  unnest() %>%
  summarise(mean = mean(missing))

# A tibble: 2 × 2
  group  mean
  <chr> <dbl>
1 A     0.11 
2 B     0.481

set.seed(1)
dat %>%
  group_by(group) %>%
  nest() %>%
  mutate(p = case_when(
    group=="A" ~ pA,
    TRUE ~ pB
  )) %>%
  mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>% 
  unnest() %>%
  ungroup() %>%
  mutate(value = case_when(
    missing == 1 ~ NA_real_,
    TRUE ~ value
  )) %>%
  select(-p, -missing)

Upvotes: 0

Related Questions