Pastor Soto
Pastor Soto

Reputation: 386

How can I get 1000 predictions from a linear model in R?

I want to simulate the predicted value from a linear regression model 1000 times and see how many of those times each type of car has the highest predicted mpg base on the independent variables of the model. I use the test and training set because I want to evaluate the fit of the model outside the training data.

data(mtcars)
library(caret)
trainingIndex <- createDataPartition(mtcars$mpg, p = 0.8, list = FALSE)
trainingset <- mtcars[trainingIndex,]
testingset <- mtcars[-trainingIndex,]

I create a data partition to have a training set and a test set. Now I have a test set and a training set I create the linear model

fit <- lm(mpg~., data = trainingset)

Now I have the linear model I tried to create a bootstrap to make the prediction from a simulation. I use boot_predict but it gives me an error.

library(finalfit)
boot_predict(fit,testingset, type = "response", R = 1000, estimate_name = NULL, 
             confint_sep = "to", condense = TRUE, boot_compare = TRUE, compare_name = NULL,
             comparison = "difference", ref_symbol = "-", digits = c(2,3))

Error: Problem with mutate() input term. x invalid format '%.2f'; use format %s for character objects i Input term is (function (x, digits) .... Run rlang::last_error() to see where the error occurred. In addition: Warning message: In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : prediction from a rank-deficient fit may be misleading

I am not sure if this is the best way to obtaining the 1000 prediction from the bootstrap

Upvotes: 2

Views: 497

Answers (2)

duckmayr
duckmayr

Reputation: 16930

The issue here is that boot_predict() calls broom::tidy(), which adds a term column if newdata has row names that are not equal to seq_len(nrow(newdata)), which breaks a formatting step in boot_predict() (you'll find that debug() is your friend). You may want to consider filing an issue with the developer of finalfit here referencing this Stack Overflow question about this.

In the mean time, you can fix this by changing the row names of your test set:

data(mtcars)
library(caret)
set.seed(42)
trainingIndex <- createDataPartition(mtcars$mpg, p = 0.8, list = FALSE)
trainingset <- mtcars[trainingIndex,]
testingset <- mtcars[-trainingIndex,]
rownames(testingset) <- 1:nrow(testingset) ## This is the new step that fixes it
fit <- lm(mpg~., data = trainingset)
library(finalfit)
boot_predict(fit, testingset)
   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
2 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
3 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
4 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
                estimate                      Difference
1 21.02 (13.07 to 27.04)                               -
2  21.68 (6.25 to 40.14)  1.42 (-7.30 to 13.93, p=0.720)
3 15.39 (12.09 to 19.82) -4.82 (-10.83 to 2.68, p=0.180)
4  13.27 (3.26 to 24.34) -6.34 (-18.20 to 4.02, p=0.200)

Upvotes: 1

StupidWolf
StupidWolf

Reputation: 46938

The part about how to use the training and testing remains unclear and I suggest you can sort that out and put it as another question. It seems that there is more than one question packed in this.

I can try and address this:

see how many of those times each type of car has the highest predicted mpg base on the independent variables of the model.

For 1 bootstrap, the basic code to fit goes like:

set.seed(111)
da = mtcars[sample(nrow(mtcars),replace=TRUE),]
fit = lm(mpg ~ .,data=da)

To get the ranking we can do:

rank(predict(fit,mtcars))

And we wrap this into a function and iterate this through many bootstraps:

bootpred = function(data){

da = da[sample(nrow(da),replace=TRUE),]
fit = lm(mpg ~ .,data=da)
rank(predict(fit,data))

}

predictions = replicate(1000,bootpred(mtcars))

The results is a matrix, every column 1 bootstrap, every row, the rank of the prediction for the car:

head(predictions[,1:5],10)
                  [,1] [,2] [,3] [,4] [,5]
Mazda RX4           18   16   12   20   18
Mazda RX4 Wag       14   12   11   16   17
Datsun 710          24   24   27   26   23
Hornet 4 Drive      22   19   20   23   21
Hornet Sportabout   15   13   15   11   15
Valiant             16   11   18   21   20
Duster 360           7   29    5    6    9
Merc 240D           23   23   23   25   25
Merc 230            25   20   19   24   32
Merc 280            20   18   22   17   16

This will tell you how many times each car has the highest value:

rowSums(predictions==1)
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
                 18                   0                   0                   0 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
                  0                   0                   3                   1 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
                 12                   1                   0                   1 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
                  0                   0                  80                  72 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
                174                   0                   3                   0 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
                 18                   3                   0                  10 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
                  3                   0                   0                   0 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
                  0                  10                 591                   0 

Upvotes: 2

Related Questions