Reputation: 1307
I would like to use the MatchIt
package to match patient and control groups in a multi-site data set. For matching, both sex and age should be used and the matching procedure should be done without replacement. The rationale should be to iterate over each site, do the matching within each site, and to concatenate the matched data frames afterward to obtain a multi-site data frame with matched samples.
This is how my dataset looks like (Only the first seven rows to give you a quick intuition):
> multi_site_df
age site group group_boolean sex_boolean
1 53 site_B patient 1 0
2 30 site_B patient 1 0
3 27 site_B control 0 0
4 32 site_B patient 1 1
5 63 site_B control 0 0
6 34 site_B control 0 0
7 34 site_B patient 1 0
...
One thing about this multi-site data set is, that in some of the sites there are more patients than controls and in other sites, it's the other way around:
site group n
1 site_A control 44
2 site_A patient 44
3 site_B control 100
4 site_B patient 79
5 site_C control 26
6 site_C patient 32
7 site_D control 25
8 site_D patient 33
I started to write some code that uses MatchIt
for each subset of the multi-site data set:
# get unique sites
sites <- unique(multi_site_df$site)
# iterate over sites and do the matching for each site
for (site in sites) {
site_df <- multi_site_df[which(multi_site_df$site == site),]
m.out <- matchit(formula=as.formula('group_boolean ~ sex_boolean + age'),
data=site_df,
method='nearest')
m.out
site_df_matched <- get_matches(m.out,site_df)
}
but it gives me this error:
Warnmeldungen: 1: In matchit2nearest(c(
180
= 0L,181
= 1L,182
= 0L,183
= 1L, : Fewer control than treated units and matching without replacement. Not all treated units will receive a match. Treated units will be matched in the order specified by m.order: largest 2: In matchit2nearest(c(238
= 0L,239
= 0L,240
= 0L,241
= 0L, : Fewer control than treated units and matching without replacement. Not all treated units will receive a match. Treated units will be matched in the order specified by m.order: largest
This warning results in matched data frames that only contain NA
values. It seems to be related to this StackOverflow post and the fact that some sites contain more patients than controls and vice versa. Is there any workaround for this besides creating a new outcome variable notY
for every site? For me, it is not important, if patients or controls are discarded. The logic would be: 'Find the best match from the majority group for every unit in the minority group and discard all leftover units"
Upvotes: 0
Views: 768
Reputation: 4424
Any workaround would be more complicated than creating a new treatment variable. You can write simple enough code to check whether control or treated units are more plentiful in each site and switch the values of the treatment variable based on that, then run MatchIt
as normal. Here is how you might do that:
if (sum(site_df$group_boolean == 1) > sum(site_df$group_boolean == 0) {
site_df$group_boolean <- 1 - site_df$group_boolean
}
That is as simple as it could be. Note that in MatchIt
4.0.0 (not yet on CRAN), there is a new estimand
argument that you can toggle between "ATT"
(the default) and "ATC"
, the latter of which switches the roles of the treatment and control groups in the matching as you desire. But you still have to tell matchit()
which one you want, so performing that check is necessary. It's important that matchit()
fails when you give it fewer control that treated units because the meaning of the estimated effect completely changes when the focal group changes. The current behavior forces users to consider whether they are making the right choice.
Also, I hope you know that you are not matching on sex and age but rather on a propensity score estimating using a logistic regression of group on sex and age. Paired units may not actually be close to each other on sex and age.
Upvotes: 0