Reputation: 89

Creating matched pairs based on condition

Suppose I have a table in the following format:

CowId    DIM    Type
1        13     Case
2         7     Case
3         3     Control
4         4     Control
5         9     Control
6         3     Control
7         5     Control
8        10     Control
9         1     Control
10        6     Control
11        7     Control
12        4     Control

I would like to randomly match Cases to Controls (1 to 1) based on +/- 3 DIM. Is there a convenient way to accomplish this task using dplyr? Any feedback would be appreciated.

Output from dput is appended:

structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L, 
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")

Upvotes: 0

Answers (2)

Sinh Nguyen

Reputation: 4487

The part randomly could be tricky. Here is my approach:

For each case Id calculate the min/max DIM
Then randomly picked either 1 or half of available Control available to them
Update the Control picked with reference to CAse ID and excluded those rows from future pick.
Repeat this step till done for all Case
In case of no picked was available a message will popup.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)

df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L, 
  5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
    ), class = "factor")), row.names = c(NA, -12L), class = "data.frame")


# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)

# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
  # calculate the min/max DIM
  current_case <- df %>% filter(CowId == i_case_id)  
  expecting_DIM_min <- current_case$DIM - 3
  expecting_DIM_max <- current_case$DIM + 3
  
  # Pick with sample
  possible_sample <- df %>%
    filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
      Picked == FALSE)
  if (nrow(possible_sample) == 0) {
    message("There is no possible sample for Case ID: ", i_case_id)
    message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
  } else {
    max_sample <- nrow(possible_sample)
    # Maximum pick - in this case OP ask for 1 - 1 matched
    # pick_number <- max(1, max_sample / 2)
    pick_number <- 1
    sample <- possible_sample %>%
      sample_n(size = 1)
    df$Picked[df$CowId %in% sample$CowId] <- TRUE
    df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
  }
}

Here is an output

df %>% filter(Picked | Type == "Case")
#>   CowId DIM    Type Picked Case_ID
#> 1     1  13    Case  FALSE      -1
#> 2     2   7    Case  FALSE      -1
#> 3     8  10 Control   TRUE       1
#> 4    10   6 Control   TRUE       2

Updated: matching 1-1 only

^{Created on 2021-04-10 by the reprex package (v1.0.0)}

Upvotes: 1

Ronak Shah

Reputation: 388947

A way in base R :

#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data. 
df[c(which(inds), result), ]

#   CowId DIM    Type
#1      1  13    Case
#2      2   7    Case
#5      5   9 Control
#10    10   6 Control

Upvotes: 3

Creating matched pairs based on condition

Answers (2)

Here is an output

Related Questions