Reputation: 705
I have a data frame with 2,130 observations: 130 patients with stroke and 2,000 with heart attack (MI) in this format.
Index Age Sex Stroke MI
1 42 M FALSE TRUE
2 76 M FALSE TRUE
3 55 F FALSE TRUE
4 80 M TRUE FALSE
5 68 F FALSE TRUE
str(Match)
'data.frame': 2130 obs. of 5 variables:
$ Index : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender: Factor w/ 2 levels "F","M": 2 1 1 1 2 1 1 2 2 1 ...
$ Age : num 45.8 44.1 67.7 37.4 46.7 72 21.4 50.8 35.8 47.2 ...
$ Stroke: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ MI : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
For each row where Stroke=TRUE [the test row] I need a single row where MI=TRUE [the matched row].
The following 2 conditions must be met:
Sex in [matched row] = Sex in [test row]
Age in [matched row] equal to or within +/- 3 years of Age in [test row]
When a row has been selected for matching it cannot be used again for a second time. Since there are 130 strokes to be matched, the output should contain 130 heart attacks. There are sufficient heart attacks to select from. In the unlikely event 130 cannot be found, match as many as possible.
I am just starting to learn R (and have no programming background), my attempts have been limited to finding rows with 'which' and 'subset', eg all male stroke patients! Any guidance on where to start? Age and sex matching is common in medical research and others will benefit from any help.
Upvotes: 0
Views: 214
Reputation: 263481
Your data is not adequate for any testing since none of the values have an acceptable match. I'm only offering this on the off chance that you are not in the analysis stage but are rather in the design stage where you need to select appropriate controls because there is an additional cost associated with enrollment.
str(dfrm4)
'data.frame': 15 obs. of 5 variables:
$ Index : int 1 2 3 4 5
$ Age : int 42 76 55 80 68
$ Sex : Factor w/ 2 levels "F","M": 2 2 1 2 1
$ Stroke: logi FALSE FALSE FALSE TRUE FALSE
$ MI : logi TRUE TRUE TRUE FALSE TRUE
#----------------
apply(dfrm4[ dfrm4$Stroke, ] , 1, function(m) {
possm <- dfrm4[!dfrm4$Stroke &
as.character(dfrm4$Sex)==m['Sex'] &
as.character(dfrm4$Index) != m['Index'] &
dfrm4$Age >= as.numeric(m['Age']) -3 &
dfrm4$Age <= as.numeric(m['Age']) +3, ];
N <- NROW(possm)
return(possm[ sample(1:N, 1), ] ) })
# --------------
$`4`
[1] Index Age Sex Stroke MI
<0 rows> (or 0-length row.names)
That result could be turned into an R data.frame with do.call(rbind,res))
There's only one stroke and no matches within 3 years of age. I did test it on a larger sample with some available matches and it did succeed, .... but please do heed the methodological advice and avoid matching if this is in the analysis stage. Using matching unnecessarily is a waste of valuable data and is often statistical malpractice. If the advancement of medical science might depend on this it would be best to get experienced statistical consultation.
Upvotes: 3