Reputation: 333
I have 4000 (continuous) predictor variables in a set of 150 patients. First, variables with are associated with survival should be identified. I therefore use the multiple testing procedures function (http://svitsrv25.epfl.ch/R-doc/library/multtest/html/MTP.html) with the t-statistic for tests of regression coefficients in Cox proportional hazards survival models to identify significant predictors. This analysis identifies 60 parameters which are significantly associated with survival. I then perform unsupervised k-means clustering with the ConensusClusterPlus package (https://www.bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html) which identifies 3 clusters as the optimal solution based on the CDF curve & progression graph. If I then perform a Kaplan-Meier survival analysis I see that each of the three clusters is associated with a distinct survival pattern (low / intermediate / long survival).
The question that I now have is the following: Lets assume that I have another set of 50 patients where I´d like to predict to which of the three clusters each patient most likely belongs to. How can I achieve this? Do I need to train a classifier (e.g. with the caret-package (topepo.github.io/caret/bytag.html) where the 150 patients with the 60 significant parameters are in the training set and the algorithm knows which patient was allocated to which of the three clusters) and validate the classifier in the 50 new patients? And then perform Kaplan-Meier survival analysis to see whether the predicted clusters in the validation set (n=50) are again associated with a a distinct survival pattern?
Thanks for your help.
Upvotes: 1
Views: 1949
Reputation: 908
My advice is to create a predictive model, such as random forest, using the cluster number as the outcome. It will lead to better results than predicting using the distances in the cluster.
The reasons are several, but consider that a predictive model is specialized in such a task, for example, it will keep and consider reliable variables (while in the cluster every variable will account the same).
Upvotes: 0
Reputation: 66805
The answer is much simpler. You do have your k-means, with 3 clusters. Each cluster is identified by its centroid (a point in your 60-dimensional space). In order to "classify" new point you just measure the euclidean distance to each of these three centroids, and select cluster which is the closest one. That's all. It comes directly from the fact, that k-means gives you partitioning of the whole space, not just your training set.
Upvotes: 1