Christopher Dean
Christopher Dean

Reputation: 89

Creating matched pairs based on condition

Suppose I have a table in the following format:

CowId    DIM    Type
1        13     Case
2         7     Case
3         3     Control
4         4     Control
5         9     Control
6         3     Control
7         5     Control
8        10     Control
9         1     Control
10        6     Control
11        7     Control
12        4     Control

I would like to randomly match Cases to Controls (1 to 1) based on +/- 3 DIM. Is there a convenient way to accomplish this task using dplyr? Any feedback would be appreciated.

Output from dput is appended:

structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L, 
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")

Upvotes: 0

Views: 124

Answers (2)

Sinh Nguyen
Sinh Nguyen

Reputation: 4487

The part randomly could be tricky. Here is my approach:

  • For each case Id calculate the min/max DIM
  • Then randomly picked either 1 or half of available Control available to them
  • Update the Control picked with reference to CAse ID and excluded those rows from future pick.
  • Repeat this step till done for all Case
  • In case of no picked was available a message will popup.
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)

df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L, 
  5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
    ), class = "factor")), row.names = c(NA, -12L), class = "data.frame")


# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)

# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
  # calculate the min/max DIM
  current_case <- df %>% filter(CowId == i_case_id)  
  expecting_DIM_min <- current_case$DIM - 3
  expecting_DIM_max <- current_case$DIM + 3
  
  # Pick with sample
  possible_sample <- df %>%
    filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
      Picked == FALSE)
  if (nrow(possible_sample) == 0) {
    message("There is no possible sample for Case ID: ", i_case_id)
    message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
  } else {
    max_sample <- nrow(possible_sample)
    # Maximum pick - in this case OP ask for 1 - 1 matched
    # pick_number <- max(1, max_sample / 2)
    pick_number <- 1
    sample <- possible_sample %>%
      sample_n(size = 1)
    df$Picked[df$CowId %in% sample$CowId] <- TRUE
    df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
  }
}

Here is an output

df %>% filter(Picked | Type == "Case")
#>   CowId DIM    Type Picked Case_ID
#> 1     1  13    Case  FALSE      -1
#> 2     2   7    Case  FALSE      -1
#> 3     8  10 Control   TRUE       1
#> 4    10   6 Control   TRUE       2

Updated: matching 1-1 only

Created on 2021-04-10 by the reprex package (v1.0.0)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388947

A way in base R :

#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data. 
df[c(which(inds), result), ]

#   CowId DIM    Type
#1      1  13    Case
#2      2   7    Case
#5      5   9 Control
#10    10   6 Control

Upvotes: 3

Related Questions