How to obtain the specific grouping of cases and controls in r

Question

I want to match 2 controls for every case with two conditions:

the age difference should between ±2;
the income difference should between ±2.

If there are more than 2 controls for a case, I just need to select 2 controls randomly. And then, how do I generate a new variable that indicates the control that each case matches? For example, Control1 and Control2 matched by Case1 are encoded as group 1, and Control1 and Control2 matched by Case2 are encoded as group 2.

DATA

dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666, 
                     777, 888, 999, 1000), 
              age = c(10, 20, 44, 11, 12, 11, 8, 12,  11, 22, 21, 18, 21, 18), 
              income = c(35, 72, 11, 35, 37, 36, 33,  70, 34, 74, 70, 44, 76, 70), 
              group = c("case", "case", "case", "case", "control", "control", 
                        "control", "control", "control", "control", "control", 
                        "control", "control", "control")), 
         row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))

EXPECTED OUTPUT

id	age	income	group	index
1	10	35	case	1
2	20	72	case	2
3	44	11	case	3
4	11	35	case	4
111	12	37	control	1
222	11	36	control	1
333	8	33	control	4
555	11	34	control	4
777	21	70	control	2
1000	18	70	control	2

This is similar to my previous question, but I want the output to have an extra variable called index to indicate the specific controls for case matching. If a case and a control have the same index, it means that specific controls is matched with that case.

The question is how can I create the index, preferably with an approach based on the previous question.

Zhiqiang Wang · Accepted Answer

This is based on the accepted answer to your previous post by @AnilGoyal:

library(dplyr, warn.conflicts = F)
dat %>% mutate(index=0) %>% 
  split(.$group) %>%
  list2env(envir = .GlobalEnv)

set.seed(12345)
for(i in seq_len(nrow(case))){
  x <- which(between(control$age, case$age[i] -2, case$age[i] +2) & 
               between(control$income, case$income[i] -2, case$income[i] + 2) & 
               control$index==0)
 control$index[sample(x, min(2, length(x)))] <- i
 case$index[i] <-i 
}

matched <- case %>% rbind(control) %>% filter(index >0)
matched

Please note: You have more than 2 controls meeting the criteria for some cases, 2 controls are randomly selected.

> matched 
# A tibble: 10 × 5
      id   age income group   index
          
 1     1    10     35 case        1
 2     2    20     72 case        2
 3     3    44     11 case        3
 4     4    11     35 case        4
 5   111    12     37 control     4
 6   222    11     36 control     1
 7   333     8     33 control     1
 8   555    11     34 control     4
 9   777    21     70 control     2
10  1000    18     70 control     2

How to obtain the specific grouping of cases and controls in r

DATA

EXPECTED OUTPUT

Answers (1)

Related Questions