Reputation: 175
I have a dataset with many Facilities with a unique Facility ID and variables clustered at the facility ID. I would like to select a number of randomly selected IDs and then introducing missing values for a given number of reported values within the Facility.
Below is a sample of the dataset.
h <- data.frame(cbind(FacilityID = rep(1:5,each=12),X1=rnorm(60,0,1)))
The data has 5 FacilityIDs with 12 values reported for each ID for a variable X1.
I would like to perform the following;
Upvotes: 3
Views: 115
Reputation: 5788
Base R:
# Set seed for reproducibility:
set.seed(2020)
# Store no_nas, the number of nas to introduce per facility: no_nas => integer vector
no_nas <- c(rep(3, 2), 4)
# Store n, the number of facilities to sample: n => integer scalar
n <- length(no_nas)
# Subset data.frame to records containing randomly sampled
# FacilityIDs assign NA vals: facidsample => data.frame
facidsample <- do.call(rbind, Map(function(x, y) {
i <- h[h$FacilityID == x, ]; i$X1[sample(seq_len(nrow(i)), y)] <- NA_real_; i
}, sample(unique(h$FacilityID), n), no_nas))
# Combine sampled data with original set less nullified entries: j => data.frame
j <- rbind(h[h$FacilityID %in% setdiff(h$FacilityID, facidsample$FacilityID),],
facidsample)
Upvotes: 2
Reputation: 9107
Here is a tidyverse
solution.
Use sample
to get the 3 IDs. sample(row_number()) <= 4
randomly selects 4 rows.
library(tidyverse)
ids <- sample(unique(h$FacilityID), 3)
h %>%
group_by(FacilityID) %>%
mutate(
X1 = case_when(
FacilityID %in% ids[1:2] & sample(row_number()) <= 3 ~ NA_real_,
FacilityID %in% ids[3] & sample(row_number()) <= 4 ~ NA_real_,
TRUE ~ X1
)
)
Upvotes: 4
Reputation: 174238
It's not clear whether you want these two operations to be performed together or individually.
Individually you could do:
# Set 3 values from 2 IDs to NA
for(i in sample(unique(h$FacilityID), 2)) {
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
# Set 4 values from 1 ID to NA:
h$X1[sample(which(h$FacilityID == sample(unique(h$FacilityID), 1)), 4)] <- NA
If you want to perform both operations at once on the same data set you can do:
IDs <- sample(unique(h$FacilityID), 3)
for(i in IDs) {
if(i == IDs[3])
h$X1[sample(which(h$FacilityID == i), 4)] <- NA
else
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
Upvotes: 4