iftach s
iftach s

Reputation: 25

Use logistic regression on data set with repeated K fold using R

I am trying to predict if water are safe to drink or not. The data set is composed of the one here: https://www.kaggle.com/adityakadiwal/water-potability?select=water_potability.csv. Assume I take the dataframe to be composed of Ph, Hardness, Solids, Chloramines and Potability.

I'd like to run logistic regression on 10 k fold (for example, I wish to try more choices). Disregarding the computational power needed, I'd also then like to conduct this with different randomized 10 k fold, 5 more times and then choose the best model.

I have come across the k fold function, and glm function , but I don't know how to combine it to repeat this process 5 randomized times. Later on, I'd also like to create something similar with KNN. I'd appreciate any help on this matter.

some code:

df <- read_csv("water_potability.csv")

train_model <- trainControl(method = "repeatedcv",  
                              number = 10, repeats = 5)

model <- train(Potability~., data = df, method = "regLogistic",
               trControl = train_model )

However, I'd prefer to use non regularized logistic.

Upvotes: 1

Views: 1039

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50668

You can do the following (based on some sample data from here)

library(caret)

# Sample data since your post doesn't include sample data
df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# Make sure the response `admit` is a `factor`
df$admit <- factor(df$admit)

# Set up 10-fold CV
train_model <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

# Train the model
model <- train(
    admit ~ ., 
    data = df, 
    method = "glm",
    family = "binomial",
    trControl = train_model)
model
#Generalized Linear Model 
#
#400 samples
#  3 predictor
#  2 classes: '0', '1' 
#
#No pre-processing
#Resampling: Cross-Validated (10 fold, repeated 5 times) 
#Summary of sample sizes: 359, 361, 360, 360, 359, 361, ... 
#Resampling results:
#
#  Accuracy   Kappa    
#  0.7020447  0.1772786

We can look at the confusion matrix for good measure


confusionMatrix(predict(model), df$admit)
#Confusion Matrix and Statistics
#
#          Reference
#Prediction   0   1
#         0 253  98
#         1  20  29
#
#              Accuracy : 0.705           
#                95% CI : (0.6577, 0.7493)
#   No Information Rate : 0.6825          
#   P-Value [Acc > NIR] : 0.1809          
#
#                 Kappa : 0.1856          
#
#Mcnemar's Test P-Value : 1.356e-12       
#                                          
#            Sensitivity : 0.9267          
#            Specificity : 0.2283          
#         Pos Pred Value : 0.7208          
#         Neg Pred Value : 0.5918          
#             Prevalence : 0.6825          
#         Detection Rate : 0.6325          
#   Detection Prevalence : 0.8775          
#      Balanced Accuracy : 0.5775          
#                                          
#       'Positive' Class : 0     

Upvotes: 1

Related Questions