Rehaan11
Rehaan11

Reputation: 19

Error in `$<-.data.frame`(`*tmp*`, Predict, value = c(`1` = 1L, `2` = 1L, : replacement has 3500 rows, data has 1500

I have run a Random Forest with tuning and have added the prediction to the Train data which ran perfectly well and had no issues. However when I tried running the random forest model on the Test dataset I get the above error. Any idea as to what could be causing this below is my code. Appreciate any help with this. The Train dataset does have 3500 rows and the Test would have 1500 rows as the dataset is made of 5000 rows.

R code:

####Clearing the global environmnent
rm(list = ls())

##Setting the working directory
setwd("D:/Great Learning/Module 3 -Machine Learning/Project")


##Packages required to be loaded
install.packages("DataExplorer")
install.packages("xlsx")
##install.packages("magrittr")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("mice")
install.packages("NbClust")

##Reading in the dataset
library(xlsx)
LoanModelRaw = read.xlsx("Thera Bank_Personal_Loan_Modelling-dataset- 1.xlsx",sheetName = "Bank_Personal_Loan_Modelling",header = T)
##LoanModelRaw = read.csv("Thera Bank_Personal_Loan_Modelling-dataset-1.csv", sep = ";",header = T)

##Viewing the dataset in R
View(LoanModelRaw)
dim(LoanModelRaw)
colnames(LoanModelRaw)
str(LoanModelRaw)
summary(LoanModelRaw)
nrow(LoanModelRaw)
attach(LoanModelRaw)

#Correcting column names
names(LoanModelRaw)[2] = "AgeInYears" 
names(LoanModelRaw)[3] = "ExperienceInYears"
names(LoanModelRaw)[4] = "IncomeInKMonth"
names(LoanModelRaw)[5] = "ZIPCode"
names(LoanModelRaw)[6] = "FamilyMembers"
names(LoanModelRaw)[10] = "PersonalLoan"
names(LoanModelRaw)[11] = "SecuritiesAccount"
names(LoanModelRaw)[12] = "CDAccount" 

colnames(LoanModelRaw)

#############################################################1 EDA of the data#######################################################

library(DataExplorer)
##introduce(LoanModelRaw)
plot_intro(LoanModelRaw)
plot_missing(LoanModelRaw)
##plot_bar(LoanModelRaw)
plot_histogram(LoanModelRaw)
create_report(LoanModelRaw)

?plot_boxplot

#Missing Value Treatment
library(mice)
sum(is.na(LoanModelRaw))
md.pattern(LoanModelRaw)
LoanModelRawImpute = mice(LoanModelRaw, m =5, method = 'pmm', seed = 1000)
LoanModelRawNoNa = complete(LoanModelRawImpute, 3)
md.pattern(LoanModelRawNoNa)

#Correcting negative experience
LoanModel = abs(LoanModelRawNoNa[2:14])
attach(LoanModel)
#View(LoanModel)
#summary(LoanModel)
#nrow(LoanModel)

################################################################################# LoanModel$Split = sample.split(LoanModel$PersonalLoan, SplitRatio = 0.7) View(LoanModel) LoanModelTrainRaw = subset(LoanModel,LoanModel$Split == TRUE) LoanModelTestRaw = subset(LoanModel,LoanModel$Split == FALSE)

#Installing the packages for the running random forest
install.packages("randomForest")
install.packages("dplyr")
library(randomForest)
library(dplyr)
attach(LoanModelTrain)
str(LoanModelTrain)

#Need to exclude the split and move columns
LoanModelTrain = LoanModelTrainRaw[1:13]
LoanModelTest = LoanModelTestRaw[1:13]
LoanModelTrain = LoanModelTrain %>% select(IncomeInKMonth,Mortgage,ZIPCode,CCAvg,everything())
LoanModelTest = LoanModelTest %>% select(IncomeInKMonth,Mortgage,ZIPCode,CCAvg, everything())
head(LoanModelTrain)
head(LoanModelTest)

###Converting the data set to a factor variable in order to be read
#Train
fcol = c(5:13)
LoanModelTrain[,fcol] = lapply(LoanModelTrain[,fcol], factor)
str(LoanModelTrain)
nrow(LoanModelTrain)

#Test
fcol = c(5:13)
LoanModelTest[,fcol] = lapply(LoanModelTest[,fcol], factor)
str(LoanModelTest)

##Running the random forest
seed = 1000
set.seed(seed)
LoanModelTrainRF = randomForest(PersonalLoan ~ ., data = LoanModelTrain, ntree = 501, mtry = 10, nodesize = 10, importance = TRUE, do.trace = TRUE)
print(LoanModelTrainRF)
plot(LoanModelTrainRF)
importance(LoanModelTrainRF)
?randomForest

###Tuning the random Forest
set.seed(seed)
LoanModelTrain = LoanModelTrain %>% select(PersonalLoan,everything())
str(LoanModelTrain)
LoanModelTrainRFTuned = tuneRF(x = LoanModelTrain[,-c(1)], 
                               y = PersonalLoan,
                               mtryStart = 10,
                               stepFactor = 1.5,
                               improve = 0.001,
                               trace = TRUE,
                               plot = TRUE,
                               doBest = TRUE,
                               importance = TRUE)

###Running refined random forest
LoanModelTrainRefinedRF = randomForest(PersonalLoan ~ ., data = LoanModelTrain, ntree = 95, mtry = 10, nodesize = 10, importance = TRUE, do.trace = TRUE)
print(LoanModelTrainRefinedRF)
plot(LoanModelTrainRefinedRF)


###Adding the prediction columns and probability columns 
LoanModelTrain$Predict = predict(LoanModelTrainRefinedRF,data= LoanModelTrain, type = "class")
LoanModelTrain$Score = predict(LoanModelTrainRefinedRF,data= LoanModelTrain, type = "prob")
head(LoanModelTrain)

###Check the accuracy of the model
install.packages("caret")
library(caret)

caret::confusionMatrix(LoanModelTrain$PersonalLoan, LoanModelTrain$Predict)


###Run the model against the Test Data
str(LoanModelTest)

** LoanModelTest$Predict = predict(LoanModelTrainRefinedRF,data= LoanModelTest, type = "class") ** LoanModelTest$Score = predict(LoanModelTrainRefinedRF,data= LoanModelTest, type = "prob")

AgeInYears  ExperienceInYears   IncomeInKMonth  ZIPCode FamilyMembers   CCAvg   Education
25  1   49  91107   4   1.6 1
45  19  34  90089   3   1.5 1
39  15  11  94720   1   1.0 1
35  9   100 94112   1   2.7 2
35  8   45  91330   4   1.0 2
37  13  29  92121   4   0.4 2

Mortgage    PersonalLoan    SecuritiesAccount   CDAccount   Online  CreditCard  Split
0   0   1   0   0   0   FALSE
0   0   1   0   0   0   FALSE
0   0   0   0   0   0   TRUE
0   0   0   0   0   0   TRUE
0   0   0   0   0   1   TRUE
155 0   0   0   1   0   TRUE

Upvotes: 0

Views: 2142

Answers (2)

MacOS
MacOS

Reputation: 1159

This error means that you try to append a column vector of length 3500 to a matrix that has 1500 rows. Of course, this does not work because R does not automatically create ǸA for the empty rows (and that is a good thing).

Try to check the dimensions (number of rows and number of columns) of LoanModelTest and LoanModelTrain. Also, check the return dimensions of the predict functions.

Upvotes: 0

user15517788
user15517788

Reputation: 11

I got the same error trying to predict a single outcome from a simple glm model. In the model I specified the outcome and predictors using the format "dataset$outcome", etc. In the "test" set (really just one row of observations, I named the columns "outcome" etc. If I remove the $s from the model and instead specify "data=dataset", then the error disapears. So perhaps it's an issue with how objects are being called.

Upvotes: 1

Related Questions