Johannes Wiesner
Johannes Wiesner

Reputation: 1307

Use MatchIt for multi-site data set does not work

I would like to use the MatchIt package to match patient and control groups in a multi-site data set. For matching, both sex and age should be used and the matching procedure should be done without replacement. The rationale should be to iterate over each site, do the matching within each site, and to concatenate the matched data frames afterward to obtain a multi-site data frame with matched samples.

This is how my dataset looks like (Only the first seven rows to give you a quick intuition):

> multi_site_df
    age   site   group group_boolean sex_boolean
1    53 site_B patient             1           0
2    30 site_B patient             1           0
3    27 site_B control             0           0
4    32 site_B patient             1           1
5    63 site_B control             0           0
6    34 site_B control             0           0
7    34 site_B patient             1           0
...

One thing about this multi-site data set is, that in some of the sites there are more patients than controls and in other sites, it's the other way around:

  site   group       n
1 site_A control    44 
2 site_A patient    44 
3 site_B control   100
4 site_B patient    79
5 site_C control    26
6 site_C patient    32
7 site_D control    25
8 site_D patient    33

I started to write some code that uses MatchIt for each subset of the multi-site data set:

# get unique sites 
sites <- unique(multi_site_df$site)

# iterate over sites and do the matching for each site
for (site in sites) {
  
  site_df <- multi_site_df[which(multi_site_df$site == site),]
  
  m.out <- matchit(formula=as.formula('group_boolean ~ sex_boolean + age'),
                   data=site_df,
                   method='nearest')
  
  m.out
  site_df_matched <- get_matches(m.out,site_df)

}

but it gives me this error:

Warnmeldungen: 1: In matchit2nearest(c(180 = 0L, 181 = 1L, 182 = 0L, 183 = 1L, : Fewer control than treated units and matching without replacement. Not all treated units will receive a match. Treated units will be matched in the order specified by m.order: largest 2: In matchit2nearest(c(238 = 0L, 239 = 0L, 240 = 0L, 241 = 0L, : Fewer control than treated units and matching without replacement. Not all treated units will receive a match. Treated units will be matched in the order specified by m.order: largest

This warning results in matched data frames that only contain NA values. It seems to be related to this StackOverflow post and the fact that some sites contain more patients than controls and vice versa. Is there any workaround for this besides creating a new outcome variable notY for every site? For me, it is not important, if patients or controls are discarded. The logic would be: 'Find the best match from the majority group for every unit in the minority group and discard all leftover units"

Upvotes: 0

Views: 768

Answers (1)

Noah
Noah

Reputation: 4424

Any workaround would be more complicated than creating a new treatment variable. You can write simple enough code to check whether control or treated units are more plentiful in each site and switch the values of the treatment variable based on that, then run MatchIt as normal. Here is how you might do that:

if (sum(site_df$group_boolean == 1) > sum(site_df$group_boolean == 0) {
  site_df$group_boolean <- 1 - site_df$group_boolean
}

That is as simple as it could be. Note that in MatchIt 4.0.0 (not yet on CRAN), there is a new estimand argument that you can toggle between "ATT" (the default) and "ATC", the latter of which switches the roles of the treatment and control groups in the matching as you desire. But you still have to tell matchit() which one you want, so performing that check is necessary. It's important that matchit() fails when you give it fewer control that treated units because the meaning of the estimated effect completely changes when the focal group changes. The current behavior forces users to consider whether they are making the right choice.

Also, I hope you know that you are not matching on sex and age but rather on a propensity score estimating using a logistic regression of group on sex and age. Paired units may not actually be close to each other on sex and age.

Upvotes: 0

Related Questions