Reputation: 35
im new to statistics and R,
Im currently practicing to use GBM model to predict "charges" value from insurance company, with variables of age, bmi, number of children, and smooker. I managed to use the gbm model, but I dont know how to compare the predicted value with the actual value here.
insure<-as.tibble(insurance)
insure<-insure %>%
mutate(Agegroup=as.factor(findInterval(age,c(18,35,50,80))))
levels(insure$Agegroup)<-c("Youth","Mid Aged","Old")
#Divide the dataset into a training and validation set for some machine learning predictions
trainds<-createDataPartition(insure$Agegroup,p=0.8,list=F)
validate<-insure[-trainds,]
trainds<-insure[trainds,]
#Set metric and control
control<-trainControl(method="cv",number=10)
metric<-"RMSE"
#Set up models
set.seed(233)
summary(fit.gbm<-train(charges~.,data=trainds,method="gbm",trControl=control,metric=metric,
verbose=F) )
I dont know which data should I use to compare? since the model used "trainds" data, should i compare it with validate data? or the actual "insure" data?
This is my attempt
plot(predict(fit.gbm), #should i use the newdata?
validate$charges, #not sure if i should use "validate$charges" or from other data
xlab = "Predicted Values",
ylab = "Observed Values")
abline(a = 0,
b = 1,
col = "red",
lwd = 2)
However, since both data have different length i keep getting error of
'x' and 'y' lengths differ
Upvotes: 0
Views: 329