Stewart Wiseman
Stewart Wiseman

Reputation: 705

Using conditions in rows to find other rows in a data frame

I have a data frame with 2,130 observations: 130 patients with stroke and 2,000 with heart attack (MI) in this format.

Index   Age     Sex Stroke  MI
1       42      M   FALSE   TRUE
2       76      M   FALSE   TRUE
3       55      F   FALSE   TRUE
4       80      M   TRUE    FALSE
5       68      F   FALSE   TRUE

str(Match)

'data.frame':   2130 obs. of  5 variables:

 $ Index : int  1 2 3 4 5 6 7 8 9 10 ...  
 $ Gender: Factor w/ 2 levels "F","M": 2 1 1 1 2 1 1 2 2 1 ...  
 $ Age   : num  45.8 44.1 67.7 37.4 46.7 72 21.4 50.8 35.8 47.2 ...  
 $ Stroke: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...  
 $ MI    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...  

For each row where Stroke=TRUE [the test row] I need a single row where MI=TRUE [the matched row].

The following 2 conditions must be met:

Sex in [matched row] = Sex in [test row]
Age in [matched row] equal to or within +/- 3 years of Age in [test row]

When a row has been selected for matching it cannot be used again for a second time. Since there are 130 strokes to be matched, the output should contain 130 heart attacks. There are sufficient heart attacks to select from. In the unlikely event 130 cannot be found, match as many as possible.

I am just starting to learn R (and have no programming background), my attempts have been limited to finding rows with 'which' and 'subset', eg all male stroke patients! Any guidance on where to start? Age and sex matching is common in medical research and others will benefit from any help.

Upvotes: 0

Views: 214

Answers (2)

zx8754
zx8754

Reputation: 56259

Try using matchControls function on e1701 package.

Upvotes: 0

IRTFM
IRTFM

Reputation: 263481

Your data is not adequate for any testing since none of the values have an acceptable match. I'm only offering this on the off chance that you are not in the analysis stage but are rather in the design stage where you need to select appropriate controls because there is an additional cost associated with enrollment.

 str(dfrm4)
'data.frame':   15 obs. of  5 variables:
 $ Index : int  1 2 3 4 5
 $ Age   : int  42 76 55 80 68
 $ Sex   : Factor w/ 2 levels "F","M": 2 2 1 2 1
 $ Stroke: logi  FALSE FALSE FALSE TRUE FALSE
 $ MI    : logi  TRUE TRUE TRUE FALSE TRUE
#----------------

  apply(dfrm4[ dfrm4$Stroke, ] , 1, function(m) {
      possm <- dfrm4[!dfrm4$Stroke & 
                     as.character(dfrm4$Sex)==m['Sex'] & 
                     as.character(dfrm4$Index) != m['Index'] &
                     dfrm4$Age >= as.numeric(m['Age']) -3 & 
                     dfrm4$Age <= as.numeric(m['Age']) +3, ];  
       N <- NROW(possm)
       return(possm[ sample(1:N, 1), ] ) })

# --------------
$`4`
[1] Index  Age    Sex    Stroke MI    
<0 rows> (or 0-length row.names)

That result could be turned into an R data.frame with do.call(rbind,res))

There's only one stroke and no matches within 3 years of age. I did test it on a larger sample with some available matches and it did succeed, .... but please do heed the methodological advice and avoid matching if this is in the analysis stage. Using matching unnecessarily is a waste of valuable data and is often statistical malpractice. If the advancement of medical science might depend on this it would be best to get experienced statistical consultation.

Upvotes: 3

Related Questions