Reputation: 175

Introducing missing values using number of IDs in R randomly

I have a dataset with many Facilities with a unique Facility ID and variables clustered at the facility ID. I would like to select a number of randomly selected IDs and then introducing missing values for a given number of reported values within the Facility.

Below is a sample of the dataset.

h <- data.frame(cbind(FacilityID = rep(1:5,each=12),X1=rnorm(60,0,1)))

The data has 5 FacilityIDs with 12 values reported for each ID for a variable X1.

I would like to perform the following;

For 2 IDs selected randomly, 3 missing values are assigned randomly within the IDs
For 1 ID selected randomly, 4 missing values are assigned randomly within the IDs

Upvotes: 3

Answers (3)

hello_friend

Reputation: 5788

Base R:

# Set seed for reproducibility:
set.seed(2020)

# Store no_nas, the number of nas to introduce per facility: no_nas => integer vector
no_nas <- c(rep(3, 2), 4)

# Store n, the number of facilities to sample: n => integer scalar
n <- length(no_nas)

# Subset data.frame to records containing randomly sampled 
# FacilityIDs assign NA vals: facidsample => data.frame
facidsample <- do.call(rbind, Map(function(x, y) {
  i <- h[h$FacilityID == x, ]; i$X1[sample(seq_len(nrow(i)), y)] <- NA_real_; i
}, sample(unique(h$FacilityID), n), no_nas))
    
# Combine sampled data with original set less nullified entries: j => data.frame
j <- rbind(h[h$FacilityID %in% setdiff(h$FacilityID, facidsample$FacilityID),],
           facidsample)

Upvotes: 2

Paul

Reputation: 9107

Here is a tidyverse solution.

Use sample to get the 3 IDs. sample(row_number()) <= 4 randomly selects 4 rows.

library(tidyverse)

ids <- sample(unique(h$FacilityID), 3)

h %>%
  group_by(FacilityID) %>%
  mutate(
    X1 = case_when(
      FacilityID %in% ids[1:2] & sample(row_number()) <= 3 ~ NA_real_,
      FacilityID %in% ids[3] & sample(row_number()) <= 4 ~ NA_real_,
      TRUE ~ X1
    )
  )

Upvotes: 4

Allan Cameron

Reputation: 174238

It's not clear whether you want these two operations to be performed together or individually.

Individually you could do:

# Set 3 values from 2 IDs to NA
for(i in sample(unique(h$FacilityID), 2)) {
  h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}

# Set 4 values from 1 ID to NA:
h$X1[sample(which(h$FacilityID == sample(unique(h$FacilityID), 1)), 4)] <- NA

If you want to perform both operations at once on the same data set you can do:

IDs <- sample(unique(h$FacilityID), 3)

for(i in IDs) {
  if(i == IDs[3])
    h$X1[sample(which(h$FacilityID == i), 4)] <- NA
  else
    h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}

Upvotes: 4

Introducing missing values using number of IDs in R randomly

Answers (3)

Related Questions