Reputation: 386
I want to simulate the predicted value from a linear regression model 1000 times and see how many of those times each type of car has the highest predicted mpg base on the independent variables of the model. I use the test and training set because I want to evaluate the fit of the model outside the training data.
data(mtcars)
library(caret)
trainingIndex <- createDataPartition(mtcars$mpg, p = 0.8, list = FALSE)
trainingset <- mtcars[trainingIndex,]
testingset <- mtcars[-trainingIndex,]
I create a data partition to have a training set and a test set. Now I have a test set and a training set I create the linear model
fit <- lm(mpg~., data = trainingset)
Now I have the linear model I tried to create a bootstrap to make the prediction from a simulation. I use boot_predict but it gives me an error.
library(finalfit)
boot_predict(fit,testingset, type = "response", R = 1000, estimate_name = NULL,
confint_sep = "to", condense = TRUE, boot_compare = TRUE, compare_name = NULL,
comparison = "difference", ref_symbol = "-", digits = c(2,3))
Error: Problem with mutate()
input term
.
x invalid format '%.2f'; use format %s for character objects
i Input term
is (function (x, digits) ...
.
Run rlang::last_error()
to see where the error occurred.
In addition: Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
I am not sure if this is the best way to obtaining the 1000 prediction from the bootstrap
Upvotes: 2
Views: 497
Reputation: 16930
The issue here is that boot_predict()
calls broom::tidy()
, which adds a term
column if newdata
has row names that are not equal to seq_len(nrow(newdata))
, which breaks a formatting step in boot_predict()
(you'll find that debug()
is your friend). You may want to consider filing an issue with the developer of finalfit
here referencing this Stack Overflow question about this.
In the mean time, you can fix this by changing the row names of your test set:
data(mtcars)
library(caret)
set.seed(42)
trainingIndex <- createDataPartition(mtcars$mpg, p = 0.8, list = FALSE)
trainingset <- mtcars[trainingIndex,]
testingset <- mtcars[-trainingIndex,]
rownames(testingset) <- 1:nrow(testingset) ## This is the new step that fixes it
fit <- lm(mpg~., data = trainingset)
library(finalfit)
boot_predict(fit, testingset)
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
3 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
4 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
estimate Difference
1 21.02 (13.07 to 27.04) -
2 21.68 (6.25 to 40.14) 1.42 (-7.30 to 13.93, p=0.720)
3 15.39 (12.09 to 19.82) -4.82 (-10.83 to 2.68, p=0.180)
4 13.27 (3.26 to 24.34) -6.34 (-18.20 to 4.02, p=0.200)
Upvotes: 1
Reputation: 46938
The part about how to use the training and testing remains unclear and I suggest you can sort that out and put it as another question. It seems that there is more than one question packed in this.
I can try and address this:
see how many of those times each type of car has the highest predicted mpg base on the independent variables of the model.
For 1 bootstrap, the basic code to fit goes like:
set.seed(111)
da = mtcars[sample(nrow(mtcars),replace=TRUE),]
fit = lm(mpg ~ .,data=da)
To get the ranking we can do:
rank(predict(fit,mtcars))
And we wrap this into a function and iterate this through many bootstraps:
bootpred = function(data){
da = da[sample(nrow(da),replace=TRUE),]
fit = lm(mpg ~ .,data=da)
rank(predict(fit,data))
}
predictions = replicate(1000,bootpred(mtcars))
The results is a matrix, every column 1 bootstrap, every row, the rank of the prediction for the car:
head(predictions[,1:5],10)
[,1] [,2] [,3] [,4] [,5]
Mazda RX4 18 16 12 20 18
Mazda RX4 Wag 14 12 11 16 17
Datsun 710 24 24 27 26 23
Hornet 4 Drive 22 19 20 23 21
Hornet Sportabout 15 13 15 11 15
Valiant 16 11 18 21 20
Duster 360 7 29 5 6 9
Merc 240D 23 23 23 25 25
Merc 230 25 20 19 24 32
Merc 280 20 18 22 17 16
This will tell you how many times each car has the highest value:
rowSums(predictions==1)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
18 0 0 0
Hornet Sportabout Valiant Duster 360 Merc 240D
0 0 3 1
Merc 230 Merc 280 Merc 280C Merc 450SE
12 1 0 1
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
0 0 80 72
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
174 0 3 0
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
18 3 0 10
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
3 0 0 0
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
0 10 591 0
Upvotes: 2