Reputation: 25
I am trying to predict if water are safe to drink or not. The data set is composed of the one here: https://www.kaggle.com/adityakadiwal/water-potability?select=water_potability.csv. Assume I take the dataframe to be composed of Ph, Hardness, Solids, Chloramines and Potability.
I'd like to run logistic regression on 10 k fold (for example, I wish to try more choices). Disregarding the computational power needed, I'd also then like to conduct this with different randomized 10 k fold, 5 more times and then choose the best model.
I have come across the k fold function, and glm function , but I don't know how to combine it to repeat this process 5 randomized times. Later on, I'd also like to create something similar with KNN. I'd appreciate any help on this matter.
some code:
df <- read_csv("water_potability.csv")
train_model <- trainControl(method = "repeatedcv",
number = 10, repeats = 5)
model <- train(Potability~., data = df, method = "regLogistic",
trControl = train_model )
However, I'd prefer to use non regularized logistic.
Upvotes: 1
Views: 1039
Reputation: 50668
You can do the following (based on some sample data from here)
library(caret)
# Sample data since your post doesn't include sample data
df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# Make sure the response `admit` is a `factor`
df$admit <- factor(df$admit)
# Set up 10-fold CV
train_model <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
# Train the model
model <- train(
admit ~ .,
data = df,
method = "glm",
family = "binomial",
trControl = train_model)
model
#Generalized Linear Model
#
#400 samples
# 3 predictor
# 2 classes: '0', '1'
#
#No pre-processing
#Resampling: Cross-Validated (10 fold, repeated 5 times)
#Summary of sample sizes: 359, 361, 360, 360, 359, 361, ...
#Resampling results:
#
# Accuracy Kappa
# 0.7020447 0.1772786
We can look at the confusion matrix for good measure
confusionMatrix(predict(model), df$admit)
#Confusion Matrix and Statistics
#
# Reference
#Prediction 0 1
# 0 253 98
# 1 20 29
#
# Accuracy : 0.705
# 95% CI : (0.6577, 0.7493)
# No Information Rate : 0.6825
# P-Value [Acc > NIR] : 0.1809
#
# Kappa : 0.1856
#
#Mcnemar's Test P-Value : 1.356e-12
#
# Sensitivity : 0.9267
# Specificity : 0.2283
# Pos Pred Value : 0.7208
# Neg Pred Value : 0.5918
# Prevalence : 0.6825
# Detection Rate : 0.6325
# Detection Prevalence : 0.8775
# Balanced Accuracy : 0.5775
#
# 'Positive' Class : 0
Upvotes: 1