Reputation: 441
I'm working with Random Forests/Logistic Regression models for my predictions. Part of my study is to create a 'new' data frame to resemble a new patient, and predict the likelihood of them experiencing death within 30 days of a procedure. After performing double-cross validation to get an accuracy rating I'm currently fitting my data on the full dataset:
#Logistic Regression Model:
fullModelMort = glm(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,family="binomial")
#Random Forest Model:
surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05))
I then am creating 'New Patients' to feed into my models for prediction and predicting the probability of mortality based on these inputs:
#New Patients for Predictions
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
#Predict using LR Model
Patient1 = predict(fullModelMort, newPatient1, type="response")
Patient2 = predict(fullModelMort, newPatient2, type="response")
#Classify whether a patient is High/Low Risk based on probability for mortality:
Determine_Mortality = function(prediction){
if(prediction > .05){
Response=paste("High Risk:", round(prediction*100,2) ,"% Chance of Mortality")
return(Response)
}
else{
Response=paste("Low Risk:", round(prediction*100,2) ,"% Chance of Mortality")
return(Response)
}
}
print(paste0("Patient 1 Results - ", Determine_Mortality(Patient1)))
print(paste0("Patient 2 Results - ", Determine_Mortality(Patient2)))
This section works fine for the logistic regression model, however, when I try to do the same thing for my Random Forest Models, I'm getting the following error:
Error in predict.randomForest(surgery.bag, newPatient1, type = "response") : Type of predictors in new data do not match that of the training data.
This is my code for the Random Forest Predictions:
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
test1=predict(surgery.bag,newPatient1,type="response")
Summary of the dataset used to fit the models (Although only a subset of these columns are used in the fitting)
ahrq_ccs age gender race asa_status bmi baseline_cancer baseline_cvd baseline_dementia
Arthroplasty knee : 3032 Min. : 1.00 F:15279 African American: 3416 I-II :15244 Min. : 2.15 No :18593 No :13947 No :28087
Nephrectomy; partial or complete : 2559 1st Qu.:48.30 M:13008 Caucasian :23768 III :12142 1st Qu.:24.61 Yes: 9694 Yes:14340 Yes: 200
Spinal fusion : 2377 Median :58.60 Other : 1103 IV-VI: 901 Median :28.20
Open prostatectomy : 2356 Mean :57.71 Mean :29.47
Colorectal resection : 2269 3rd Qu.:68.30 3rd Qu.:32.84
Hysterectomy; abdominal and vaginal: 2253 Max. :90.00 Max. :92.59
(Other) :13441
baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary baseline_charlson mortality_rsi complication_rsi ccsMort30Rate ccsComplicationRate
No :24582 No :22021 No :23195 No :25639 No :25202 Min. : 0.000 Min. :-4.4000 Min. :-4.7200 Min. :0.000000 Min. :0.01612
Yes: 3705 Yes: 6266 Yes: 5092 Yes: 2648 Yes: 3085 1st Qu.: 0.000 1st Qu.:-1.2400 1st Qu.:-0.8600 1st Qu.:0.000789 1st Qu.:0.08198
Median : 0.000 Median :-0.3000 Median :-0.3100 Median :0.002764 Median :0.10937
Mean : 1.178 Mean :-0.5385 Mean :-0.4258 Mean :0.004328 Mean :0.13322
3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.007398 3rd Qu.:0.18337
Max. :13.000 Max. : 4.8300 Max. :12.5600 Max. :0.016673 Max. :0.46613
hour dow month moonphase mort30 complication
Min. : 6.000 Fri:5351 Jun : 2845 First Quarter:7126 0:28170 0:24542
1st Qu.: 7.000 Mon:6223 Aug : 2734 Full Moon :7175 1: 117 1: 3745
Median : 9.000 Thu:4936 Mar : 2587 Last Quarter :7159
Mean : 9.854 Tue:6258 Apr : 2547 New Moon :6827
3rd Qu.:12.000 Wed:5519 Jan : 2534
Max. :18.000 May : 2524
(Other):12516
I'm curious if it's because the levels of my test data don't match the levels of the dataset used for fitting, but I can't be certain.
The dataset can be downloaded here Reproducible Code:
library(MASS)
library(ggplot2)
library(Hmisc)
library(corrplot)
library(dplyr)
library(randomForest)
library(tidyr)
#Read in the data set
surgery=read.csv("SurgeryTiming.csv")
#Remove dummy values
surgery$gender[surgery$gender == ""] <- NA
surgery$asa_status[surgery$asa_status == ""] <- NA
surgery$race[surgery$race == ""] <- NA
surgery$bmi[surgery$bmi == ""] <- NA
surgery$hour = as.numeric(sub("\\..*", "", as.character(surgery$hour))) #Split out the base hour of surgery
#Drop NA values
surgery = surgery %>% drop_na(gender)
surgery = surgery %>% drop_na(asa_status)
surgery = surgery %>% drop_na(race)
surgery = surgery %>% drop_na(age)
surgery = surgery %>% drop_na(bmi)
#Drop additional levels that now have no values
surgery$gender = droplevels(surgery$gender)
surgery$asa_status = droplevels(surgery$asa_status)
surgery$race = droplevels(surgery$race)
#View our numeric data distributions
num_data <- surgery[,sapply(surgery,is.numeric)]
hist.data.frame(num_data)
surgery$complication=revalue(surgery$complication,c("Yes"=1))
surgery$complication=revalue(surgery$complication,c("No"=0))
surgery$mort30=revalue(surgery$mort30,c("Yes"=1))
surgery$mort30=revalue(surgery$mort30,c("No"=0))
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05)) #The cutoff is the probability for each group selection, probs of 1% or higher are classified as 'Mortality' occuring
test1=predict(surgery.bag,newPatient1,type="response")
Any suggestions/advice is extremely appreciated.
Upvotes: 0
Views: 974
Reputation: 278
I can't offer a solution but the error that you're getting has led me to try:
test1=predict(surgery.bag,newPatient1,type="response")
test2 <- predict(surgery.bag,surgery[7,],type = "response")
The last line will run, making a prediction on row #7 of the original dataframe. This tells me that there is an issue in the way that you are creating newPatient1 and newPatient2. The data in those rows has the correct type (factor, num, etc...) but when you create (for example) newPatient1$baseline_cancer it is a factor with only 1 level, whereas surgery$baseline_cancer is a factor with 2 levels. You can see the differences by calling:
str(surgery[,names(newPatient1)])
str(newPatient1)
I am not sure how to create newPatient1 with the correct level of factors for each variable.
Upvotes: 0