zhiwei li
zhiwei li

Reputation: 1711

How to obtain the specific grouping of cases and controls in r

I want to match 2 controls for every case with two conditions:

  1. the age difference should between ±2;

  2. the income difference should between ±2.

If there are more than 2 controls for a case, I just need to select 2 controls randomly. And then, how do I generate a new variable that indicates the control that each case matches? For example, Control1 and Control2 matched by Case1 are encoded as group 1, and Control1 and Control2 matched by Case2 are encoded as group 2.

DATA

dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666, 
                     777, 888, 999, 1000), 
              age = c(10, 20, 44, 11, 12, 11, 8, 12,  11, 22, 21, 18, 21, 18), 
              income = c(35, 72, 11, 35, 37, 36, 33,  70, 34, 74, 70, 44, 76, 70), 
              group = c("case", "case", "case", "case", "control", "control", 
                        "control", "control", "control", "control", "control", 
                        "control", "control", "control")), 
         row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))

EXPECTED OUTPUT

id age income group index
1 10 35 case 1
2 20 72 case 2
3 44 11 case 3
4 11 35 case 4
111 12 37 control 1
222 11 36 control 1
333 8 33 control 4
555 11 34 control 4
777 21 70 control 2
1000 18 70 control 2

This is similar to my previous question, but I want the output to have an extra variable called index to indicate the specific controls for case matching. If a case and a control have the same index, it means that specific controls is matched with that case.

The question is how can I create the index, preferably with an approach based on the previous question.

Upvotes: 2

Views: 402

Answers (1)

Zhiqiang Wang
Zhiqiang Wang

Reputation: 6769

This is based on the accepted answer to your previous post by @AnilGoyal:

library(dplyr, warn.conflicts = F)
dat %>% mutate(index=0) %>% 
  split(.$group) %>%
  list2env(envir = .GlobalEnv)

set.seed(12345)
for(i in seq_len(nrow(case))){
  x <- which(between(control$age, case$age[i] -2, case$age[i] +2) & 
               between(control$income, case$income[i] -2, case$income[i] + 2) & 
               control$index==0)
 control$index[sample(x, min(2, length(x)))] <- i
 case$index[i] <-i 
}

matched <- case %>% rbind(control) %>% filter(index >0)
matched 

Please note: You have more than 2 controls meeting the criteria for some cases, 2 controls are randomly selected.

> matched 
# A tibble: 10 × 5
      id   age income group   index
   <dbl> <dbl>  <dbl> <chr>   <dbl>
 1     1    10     35 case        1
 2     2    20     72 case        2
 3     3    44     11 case        3
 4     4    11     35 case        4
 5   111    12     37 control     4
 6   222    11     36 control     1
 7   333     8     33 control     1
 8   555    11     34 control     4
 9   777    21     70 control     2
10  1000    18     70 control     2

Upvotes: 2

Related Questions