Survival Model Validation using external data

Question

I have two data set (training and validation) for building and validating a Cox model.

With the training data set I fitted a cox model using stepwise selection method.

The significant variables in the model were the only variables included in the validation model. Is this the right approach?

While validating the model I realized that the variables are not significant in the validation model and also the assumptions of the cox model do not hold (I checked the assumption on the validation data). Should I ignore the fact that the variables are insignificant and go ahead in making corrections for the problem with model assumptions in validation data?

Thirdly,in both training and validation data I have a variable 'treatment' with three groups. In training the groups are Standard, New drug and mixture, while in validation data the groups are Standard, New drug and X (is a treatment which is different from mixture in training data). Is it right to include this variable in both model or should I eliminate the groups that are not match; mixture from training data and X from validation data or should I work with it like that? I am not sure how this affects my analysis.

Thanks for your responses.

StatMan · Accepted Answer

To answer your first question: Yes, this is the right approach. The whole idea of a training and validation set is that you make all the decisions about the model (here: which variables to add) based on the training set. The validation set is then used to assess how robust your results in the training set are. This way you can check for overfitting, outliers, data errors etc.

However, I would not recommend stepwise regression methods. See the top answer of this post: https://stats.stackexchange.com/questions/115843/backward-selection-for-cox-model-using-r.

Second question: No, you should not ignore the insignificant variables. This is exactly the reason why you have a validation set. Maybe your training set has a couple of very influential observations (outliers)? Or something else? Anyway, you have to do some extra research.

Which assumption do you mean? I assume that you mean that the Proportional Hazards (PH) assumption does not hold, since this assumption is often violated. Same line of reasoning as in the answer of your first question. Check the assumption first on the training set. If it also doesn't hold there, make adjustments in your model. If it indeed is the PH assumption which is violated for a variable, add a time-interaction or make a stratified cox model. [see for example: http://www.dbc.wroc.pl/Content/27006/Borucka_Extensions_of_Cox_model_For_non_proportional.pdf]

I am not entirely sure about my answer for the third question, but here it is: It is not right to include X in your validation model, if it is not included in your training model. The variable treatment is a factor, so in a regression it essentially changes to dummy (0/1) variables for each level. Including X is thus the same as introducing a whole new variable in your validation model, which is counter-intuitive.

Hope this helps!

Survival Model Validation using external data

Answers (1)

Related Questions