LZG
LZG

Reputation: 59

R package MatchIt with factor variables

I'm using the R package MatchIt to calculate propensity score weights to be used into a straightforward survival analysis, and I'm noticing very different behaviors according to the fact that some covariates entering the propensity score calculations are factors or numeric.

An example: simple code for 3 variables, one of which is numeric (size) and two factors (say tumor stage, smoking habits). Treatment variable is a factor indicating the type of surgery.

Example 1: with stage as factor and smoking habit as integer,

> sapply(surg.data[,confounders], class)
tumor_size  TNM.STAGE smoking_hx 
 "numeric"   "factor"  "integer" 

I calculate the propensity scores with the following code and extract the weights

data.for.ps = surg.data[,c('record_id','surgeries_combined_n', confounders)]

match.it.1 <- matchit(as.formula(paste0('surgeries_combined_n ~',paste0(confounders, collapse='+'))), 
   data=data.for.ps, method='full', distance='logit')
match.it.1$nn
m.data = match.data(match.it.1)
m.data$weights = match.it.1$weights

No big problems. The result of the corresponding, weighted survival analysis is the following, no matter here what "blue" and "red" means:

Plot1: stage=factor; smoking=integer

Example 2 is exactly the same, but with tumor stage now a numeric

> sapply(surg.data[,confounders], class)
tumor_size  TNM.STAGE smoking_hx 
 "numeric"  "numeric"  "integer" 

Exactly the same code for matching, exactly the same code for the survival analysis, the result is the following:

enter image description here

not very different, but different.

Example 3 is exactly the same code, but with both tumor stage and smoking habit factors:

> sapply(surg.data[,confounders], class)
tumor_size  TNM.STAGE smoking_hx 
 "numeric"   "factor"   "factor" 

The result, using exactly the same code, is the following:

enter image description here

totally different.

Now, there is no reason why one of the two potential factors should be numeric: they can be both factors, but the results are unquestionably different. Can anybody help me understand

  1. Why this happens? I don't think it's a coding problem, but more of understanding which is the correct class to put into match.it.
  2. Which is the "correct" solution with MatchIt, keeping in mind that in the package vignette all the variables entering the propensity score calculations are numeric or integer, even those potentially coded as factors (such as education level, or marital status).
  3. Should factors stay always factors? What if a factor is coded, say, 0,1,2,3 (numeric values but class=factor): should it stay a factor?

Thank you so much for your help! EM

Upvotes: 2

Views: 1480

Answers (1)

Noah
Noah

Reputation: 4414

This is not a bug in MatchIt but rather a real event that can occur when analyzing any kind of data. Numeric variables contain a lot of hidden assumptions; in particular, that the values have a meaningful order and that the spacing between consecutive values is the same. When using numeric variables in a model, you are assuming there is a linear relationship between the variable and the outcome of the model. If these assumptions are invalid, then there is a risk that your results will be as well.

It's smart of you to assess the sensitivity of your results to these kinds of assumptions. It's hard to know what the right answer is. The most conservative perspective is to consider the variable as factors, which requires no assumption about the functional form of an otherwise numeric variable (though a flexibly modeled numeric predictor could be effective as well). This method requires no assumptions about the nature of the variables, but you lose precision in your estimates if any of the assumptions for numeric variables are indeed valid.

Because propensity score matching really just relies on a good propensity score and the role of the covariates in the model is mostly a nuisance, you should determine which propensity score model yields the best balance on your covariates. Again, assessing balance requires assumptions to be made about how the variables are distributed, but it's totally feasible and advisable to assess balance on the covariates under a variety of transformations and forms. If one propensity score specification yields better balance across transformations of the covariate, then that is the propensity score model that should be trusted. Going beyond standardized mean differences and looking at the full distribution of the covariate in both groups will help you make a more informed decision.

Upvotes: 2

Related Questions