Reputation: 19
I have run a Random Forest with tuning and have added the prediction to the Train data which ran perfectly well and had no issues. However when I tried running the random forest model on the Test dataset I get the above error. Any idea as to what could be causing this below is my code. Appreciate any help with this. The Train dataset does have 3500 rows and the Test would have 1500 rows as the dataset is made of 5000 rows.
R code:
####Clearing the global environmnent
rm(list = ls())
##Setting the working directory
setwd("D:/Great Learning/Module 3 -Machine Learning/Project")
##Packages required to be loaded
install.packages("DataExplorer")
install.packages("xlsx")
##install.packages("magrittr")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("mice")
install.packages("NbClust")
##Reading in the dataset
library(xlsx)
LoanModelRaw = read.xlsx("Thera Bank_Personal_Loan_Modelling-dataset- 1.xlsx",sheetName = "Bank_Personal_Loan_Modelling",header = T)
##LoanModelRaw = read.csv("Thera Bank_Personal_Loan_Modelling-dataset-1.csv", sep = ";",header = T)
##Viewing the dataset in R
View(LoanModelRaw)
dim(LoanModelRaw)
colnames(LoanModelRaw)
str(LoanModelRaw)
summary(LoanModelRaw)
nrow(LoanModelRaw)
attach(LoanModelRaw)
#Correcting column names
names(LoanModelRaw)[2] = "AgeInYears"
names(LoanModelRaw)[3] = "ExperienceInYears"
names(LoanModelRaw)[4] = "IncomeInKMonth"
names(LoanModelRaw)[5] = "ZIPCode"
names(LoanModelRaw)[6] = "FamilyMembers"
names(LoanModelRaw)[10] = "PersonalLoan"
names(LoanModelRaw)[11] = "SecuritiesAccount"
names(LoanModelRaw)[12] = "CDAccount"
colnames(LoanModelRaw)
#############################################################1 EDA of the data#######################################################
library(DataExplorer)
##introduce(LoanModelRaw)
plot_intro(LoanModelRaw)
plot_missing(LoanModelRaw)
##plot_bar(LoanModelRaw)
plot_histogram(LoanModelRaw)
create_report(LoanModelRaw)
?plot_boxplot
#Missing Value Treatment
library(mice)
sum(is.na(LoanModelRaw))
md.pattern(LoanModelRaw)
LoanModelRawImpute = mice(LoanModelRaw, m =5, method = 'pmm', seed = 1000)
LoanModelRawNoNa = complete(LoanModelRawImpute, 3)
md.pattern(LoanModelRawNoNa)
#Correcting negative experience
LoanModel = abs(LoanModelRawNoNa[2:14])
attach(LoanModel)
#View(LoanModel)
#summary(LoanModel)
#nrow(LoanModel)
################################################################################# LoanModel$Split = sample.split(LoanModel$PersonalLoan, SplitRatio = 0.7) View(LoanModel) LoanModelTrainRaw = subset(LoanModel,LoanModel$Split == TRUE) LoanModelTestRaw = subset(LoanModel,LoanModel$Split == FALSE)
#Installing the packages for the running random forest
install.packages("randomForest")
install.packages("dplyr")
library(randomForest)
library(dplyr)
attach(LoanModelTrain)
str(LoanModelTrain)
#Need to exclude the split and move columns
LoanModelTrain = LoanModelTrainRaw[1:13]
LoanModelTest = LoanModelTestRaw[1:13]
LoanModelTrain = LoanModelTrain %>% select(IncomeInKMonth,Mortgage,ZIPCode,CCAvg,everything())
LoanModelTest = LoanModelTest %>% select(IncomeInKMonth,Mortgage,ZIPCode,CCAvg, everything())
head(LoanModelTrain)
head(LoanModelTest)
###Converting the data set to a factor variable in order to be read
#Train
fcol = c(5:13)
LoanModelTrain[,fcol] = lapply(LoanModelTrain[,fcol], factor)
str(LoanModelTrain)
nrow(LoanModelTrain)
#Test
fcol = c(5:13)
LoanModelTest[,fcol] = lapply(LoanModelTest[,fcol], factor)
str(LoanModelTest)
##Running the random forest
seed = 1000
set.seed(seed)
LoanModelTrainRF = randomForest(PersonalLoan ~ ., data = LoanModelTrain, ntree = 501, mtry = 10, nodesize = 10, importance = TRUE, do.trace = TRUE)
print(LoanModelTrainRF)
plot(LoanModelTrainRF)
importance(LoanModelTrainRF)
?randomForest
###Tuning the random Forest
set.seed(seed)
LoanModelTrain = LoanModelTrain %>% select(PersonalLoan,everything())
str(LoanModelTrain)
LoanModelTrainRFTuned = tuneRF(x = LoanModelTrain[,-c(1)],
y = PersonalLoan,
mtryStart = 10,
stepFactor = 1.5,
improve = 0.001,
trace = TRUE,
plot = TRUE,
doBest = TRUE,
importance = TRUE)
###Running refined random forest
LoanModelTrainRefinedRF = randomForest(PersonalLoan ~ ., data = LoanModelTrain, ntree = 95, mtry = 10, nodesize = 10, importance = TRUE, do.trace = TRUE)
print(LoanModelTrainRefinedRF)
plot(LoanModelTrainRefinedRF)
###Adding the prediction columns and probability columns
LoanModelTrain$Predict = predict(LoanModelTrainRefinedRF,data= LoanModelTrain, type = "class")
LoanModelTrain$Score = predict(LoanModelTrainRefinedRF,data= LoanModelTrain, type = "prob")
head(LoanModelTrain)
###Check the accuracy of the model
install.packages("caret")
library(caret)
caret::confusionMatrix(LoanModelTrain$PersonalLoan, LoanModelTrain$Predict)
###Run the model against the Test Data
str(LoanModelTest)
** LoanModelTest$Predict = predict(LoanModelTrainRefinedRF,data= LoanModelTest, type = "class") ** LoanModelTest$Score = predict(LoanModelTrainRefinedRF,data= LoanModelTest, type = "prob")
AgeInYears ExperienceInYears IncomeInKMonth ZIPCode FamilyMembers CCAvg Education
25 1 49 91107 4 1.6 1
45 19 34 90089 3 1.5 1
39 15 11 94720 1 1.0 1
35 9 100 94112 1 2.7 2
35 8 45 91330 4 1.0 2
37 13 29 92121 4 0.4 2
Mortgage PersonalLoan SecuritiesAccount CDAccount Online CreditCard Split
0 0 1 0 0 0 FALSE
0 0 1 0 0 0 FALSE
0 0 0 0 0 0 TRUE
0 0 0 0 0 0 TRUE
0 0 0 0 0 1 TRUE
155 0 0 0 1 0 TRUE
Upvotes: 0
Views: 2142
Reputation: 1159
This error means that you try to append a column vector of length 3500 to a matrix that has 1500 rows. Of course, this does not work because R does not automatically create ǸA
for the empty rows (and that is a good thing).
Try to check the dimensions (number of rows and number of columns) of LoanModelTest
and LoanModelTrain
. Also, check the return dimensions of the predict
functions.
Upvotes: 0
Reputation: 11
I got the same error trying to predict a single outcome from a simple glm model. In the model I specified the outcome and predictors using the format "dataset$outcome", etc. In the "test" set (really just one row of observations, I named the columns "outcome" etc. If I remove the $s from the model and instead specify "data=dataset", then the error disapears. So perhaps it's an issue with how objects are being called.
Upvotes: 1