Reputation: 1700
I am reading through predict() in R and am confused:
There is a dataset Spam from which we have created a train data and test data using random sampling. We have used the trainSpam(training data set to train the system). We want to see how good the model is, by testing on the test dataset(testSpam).
predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)
predictionTest = predict(predictionModel, testSpam)
predictedSpam = rep("nonspam", dim(testSpam)[1])
predictedSpam[predictionModel$fitted > 0.5] = "spam" #Here is my problem
table(predictedSpam, testSpam$type)
In the line where we say:
predictedSpam[predictionModel$fitted > 0.5] = "spam"
How does predictionModel$fitted
predict spams in the test data. It seems to be using predictionModel$fitted from the training data. Then we go on to compare with the spams of test data. Can someone explain?
Here is what I understood. In the line:
predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)
We create a model using the trainSpam data.
In the next line:
predictionTest = predict(predictionModel, testSpam)
We create predictionTest using the same model but the test data.
In the next line:
predictedSpam = rep("nonspam", dim(testSpam)[1])
We created a vector with all values "nonspam"
In the next line:
predictedSpam[predictionModel$fitted > 0.5] = "spam"
We are using the predictionModel$fitted, which has been fitted over the training data to decide which of the rows are to be classified as spam. Shouldn't we rather use something like predictionTest to identify the spams?
My idea of what it should be is:
> predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)
> predictionTest = predict(predictionModel, testSpam,type="response")
> predictedSpam = rep("nonspam", dim(testSpam)[1])
> predictedSpam[predictionTest > 0.5] = "spam"
> table(predictedSpam, testSpam$type)
Upvotes: 2
Views: 4683
Reputation: 263301
I think you want type="response"
in the predict
call, since the default will otherwise be the linear predictor.
?predict.glm # different than ?predict
(This is, of course, if I am correctly intuiting your unstated goal to be finding cases in your test set with probabilities greater than 0.5. Furthermore, if you are really getting predictions based on training data, it means your test dataframe was malformed, and that you need to edit your question to contain output from both str(trainSpam)
and str(testSpam)
so we can show you how to properly create a data
argument for predict
.)
After update: So it looks like charDollar
is in both test and train sets, so you should not be getting predictions in predictionTest
from the training set. You should get predicted > 50% Spam cases with:
testSpam[ predict(fit, data=testSpam, type="response) > .5 ]
I'm not sure what code was used to create predictionTest
and wonder if you meant to type predictedSpam
. This is what I thought would succeed:
predictedSpam = predict(predictionModel, testSpam)
spam <- predictedSpam[ predictedSpam$fitted > 0.5 ]
Upvotes: 1