Reputation: 89
Suppose I have a table in the following format:
CowId DIM Type
1 13 Case
2 7 Case
3 3 Control
4 4 Control
5 9 Control
6 3 Control
7 5 Control
8 10 Control
9 1 Control
10 6 Control
11 7 Control
12 4 Control
I would like to randomly match Cases to Controls (1 to 1) based on +/- 3 DIM. Is there a convenient way to accomplish this task using dplyr? Any feedback would be appreciated.
Output from dput is appended:
structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
Upvotes: 0
Views: 124
Reputation: 4487
The part randomly could be tricky. Here is my approach:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)
# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
# calculate the min/max DIM
current_case <- df %>% filter(CowId == i_case_id)
expecting_DIM_min <- current_case$DIM - 3
expecting_DIM_max <- current_case$DIM + 3
# Pick with sample
possible_sample <- df %>%
filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
Picked == FALSE)
if (nrow(possible_sample) == 0) {
message("There is no possible sample for Case ID: ", i_case_id)
message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
} else {
max_sample <- nrow(possible_sample)
# Maximum pick - in this case OP ask for 1 - 1 matched
# pick_number <- max(1, max_sample / 2)
pick_number <- 1
sample <- possible_sample %>%
sample_n(size = 1)
df$Picked[df$CowId %in% sample$CowId] <- TRUE
df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
}
}
df %>% filter(Picked | Type == "Case")
#> CowId DIM Type Picked Case_ID
#> 1 1 13 Case FALSE -1
#> 2 2 7 Case FALSE -1
#> 3 8 10 Control TRUE 1
#> 4 10 6 Control TRUE 2
Updated: matching 1-1 only
Created on 2021-04-10 by the reprex package (v1.0.0)
Upvotes: 1
Reputation: 388947
A way in base R :
#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data.
df[c(which(inds), result), ]
# CowId DIM Type
#1 1 13 Case
#2 2 7 Case
#5 5 9 Control
#10 10 6 Control
Upvotes: 3