Reputation: 21
I have a model, called predictive_fit <- fit(workflow, training)
that classifies the Iris dataset species using xgboost. The data are pivoted wide such that each species is a dummied column represented by a 0 or 1. Here, I am trying to predict Virginica based on the Sepal and Petal columns.
Currently, I have the following code which then takes the dataset after the model has been fit to test if it can accurately predict the Virginia species of iris. (Snippet below)
testing_data <-
test %>%
bind_cols(
predict(predictive_fit, test)
)
I cannot, however, figure out how to scale this up with simulation. If I have another dataset with exactly the same structure, I would like to predict whether it is Virginica 100 times. (Snippet below)
new_iris_data <-
new_iris_data %>%
bind_cols(
replicate(n = 100, predict(predictive_fit, new_iris_data))
)
However, it looks as if when I run the new data the same predictions are just being copied 100 times. What is the appropriate way to repeatedly predict the classification? I wouldn't expect that all 100 times the model would predict exactly the same thing, but I'd like some way to have the predictions run n number of times so each and every row of new data can have its own proportion calculated.
I have already tried using the replicate() function to try this. However, it appears as if it copies the same exact results 100 times. I considered having a for loop that iterated through a different seed and then ran the predictions, but I was hoping for a more performant solution out there.
Upvotes: 0
Views: 110
Reputation: 886
You are replicating the prediction of you model, not the data.frame you call new_iris_data
, and the result is exactly that. In order to replicate a (random) part of the iris dataset, try this:
> data("iris")
>
> sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
>
> train <- iris[sample,]
> test <- iris[-sample,]
>
> new_test <- replicate(100, test, simplify = FALSE)
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> nrow(new_test)
[1] 7500
The you can use the new_test
in any prediction, independent of the model.
If you want 100 differents random parts of the data set, you need to drop the replicate function and do something like:
> new_test <- lapply(1:100, function(x) {
+ sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
+ iris[-sample,]
+ })
>
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
18 5.1 3.5 1.4 0.3 setosa
> nrow(new_test)
[1] 7500
>
Hope it helps.
Upvotes: 1