Reputation: 733
I've trained a tree model with R caret. I'm now trying to generate a confusion matrix and keep getting the following error:
Error in confusionMatrix.default(predictionsTree, testdata$catgeory) : the data and reference factors must have the same number of levels
prob <- 0.5 #Specify class split
singleSplit <- createDataPartition(modellingData2$category, p=prob,
times=1, list=FALSE)
cvControl <- trainControl(method="repeatedcv", number=10, repeats=5)
traindata <- modellingData2[singleSplit,]
testdata <- modellingData2[-singleSplit,]
treeFit <- train(traindata$category~., data=traindata,
trControl=cvControl, method="rpart", tuneLength=10)
predictionsTree <- predict(treeFit, testdata)
confusionMatrix(predictionsTree, testdata$catgeory)
The error occurs when generating the confusion matrix. The levels are the same on both objects. I cant figure out what the problem is. Their structure and levels are given below. They should be the same. Any help would be greatly appreciated as its making me cracked!!
> str(predictionsTree)
Factor w/ 30 levels "16-Merchant Service Charge",..: 28 22 22 22 22 6 6 6 6 6 ...
> str(testdata$category)
Factor w/ 30 levels "16-Merchant Service Charge",..: 30 30 7 7 7 7 7 30 7 7 ...
> levels(predictionsTree)
[1] "16-Merchant Service Charge" "17-Unpaid Cheque Fee" "18-Gov. Stamp Duty" "Misc" "26-Standard Transfer Charge"
[6] "29-Bank Giro Credit" "3-Cheques Debit" "32-Standing Order - Debit" "33-Inter Branch Payment" "34-International"
[11] "35-Point of Sale" "39-Direct Debits Received" "4-Notified Bank Fees" "40-Cash Lodged" "42-International Receipts"
[16] "46-Direct Debits Paid" "56-Credit Card Receipts" "57-Inter Branch" "58-Unpaid Items" "59-Inter Company Transfers"
[21] "6-Notified Interest Credited" "61-Domestic" "64-Charge Refund" "66-Inter Company Transfers" "67-Suppliers"
[26] "68-Payroll" "69-Domestic" "73-Credit Card Payments" "82-CHAPS Fee" "Uncategorised"
> levels(testdata$category)
[1] "16-Merchant Service Charge" "17-Unpaid Cheque Fee" "18-Gov. Stamp Duty" "Misc" "26-Standard Transfer Charge"
[6] "29-Bank Giro Credit" "3-Cheques Debit" "32-Standing Order - Debit" "33-Inter Branch Payment" "34-International"
[11] "35-Point of Sale" "39-Direct Debits Received" "4-Notified Bank Fees" "40-Cash Lodged" "42-International Receipts"
[16] "46-Direct Debits Paid" "56-Credit Card Receipts" "57-Inter Branch" "58-Unpaid Items" "59-Inter Company Transfers"
[21] "6-Notified Interest Credited" "61-Domestic" "64-Charge Refund" "66-Inter Company Transfers" "67-Suppliers"
[26] "68-Payroll" "69-Domestic" "73-Credit Card Payments" "82-CHAPS Fee" "Uncategorised"
Upvotes: 26
Views: 88934
Reputation: 1
Look at the data type! My issue was that data had type int and reference had num. They need the same type.
Upvotes: 0
Reputation: 232
I just ran into the same problem, I solved it by using R ordered factor data type.
levels <- levels(predictionsTree)
levels <- levels[order(levels)]
table(ordered(predictionsTree,levels), ordered(testdata$catgeory, levels))
Upvotes: 0
Reputation: 421
If your data contains NAs then sometimes it will be considered as a factor level,So omit these NAs initially
DF = na.omit(DF)
Then,if your model fit is predicting some incorrect level,then it is better to use tables
confusionMatrix(table(Arg1, Arg2))
Upvotes: 0
Reputation: 2435
make sure you installed the package with all its dependencies:
install.packages('caret', dependencies = TRUE)
confusionMatrix( table(prediction, true_value) )
Upvotes: 0
Reputation: 61
Maybe your model is not predicting a certain factor.
Use the table()
function instead of confusionMatrix()
to see if that is the problem.
Upvotes: 5
Reputation: 57
Change them into a data frame and then use them in confusionMatrix function:
pridicted <- factor(predict(treeFit, testdata))
real <- factor(testdata$catgeory)
my_data1 <- data.frame(data = pridicted, type = "prediction")
my_data2 <- data.frame(data = real, type = "real")
my_data3 <- rbind(my_data1,my_data2)
# Check if the levels are identical
identical(levels(my_data3[my_data3$type == "prediction",1]) , levels(my_data3[my_data3$type == "real",1]))
confusionMatrix(my_data3[my_data3$type == "prediction",1], my_data3[my_data3$type == "real",1], dnn = c("Prediction", "Reference"))
Upvotes: 2
Reputation: 241
Try use:
confusionMatrix(table(Argument 1, Argument 2))
Thats worked for me.
Upvotes: 24
Reputation: 9175
Try specifying na.pass
for the na.action
option:
predictionsTree <- predict(treeFit, testdata,na.action = na.pass)
Upvotes: 4
Reputation: 11
I had same issue but went ahead and changed it after reading data file like so..
data = na.omit(data)
Thanks all for pointer!
Upvotes: 1
Reputation: 2939
The length problem you're running into is probably due to the presence of NAs in the training set -- either drop the cases that are not complete, or impute so that you do not have missing values.
Upvotes: 0
Reputation: 11
Might be there are missing values in the testdata, Add the following line before "predictionsTree <- predict(treeFit, testdata)" to remove NAs. I had the same error and now it works for me.
testdata <- testdata[complete.cases(testdata),]
Upvotes: 0