Prediction using rpart on new factor (categorical) variables

Question

I am practising machine learning using R. I am using rpart method for the training. The data is the adult data set from the UCI. Link as follows

http://archive.ics.uci.edu/ml/datasets/Adult

#Get the data    
adultData <- read.table("adult.data", header = FALSE, sep = ",")
adultName <- read.csv("adult.name", header = TRUE, sep = ",", stringsAsFactors = FALSE)
names(adultData) <- names(adultName)

In order to simplify the practice, I only select several attributes and reduce the data set to 20% only

selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
adultData <- subset(adultData, select = selected)
trainIndex = createDataPartition(adultData$salary, p=0.20, list=FALSE)
training = adultData[ trainIndex, ]

It will take about a minute to fit the model using "rpart" (it is slower with "gbm" or "rf")

set.seed(33833)
modFit <- train(salary ~ ., method = "rpart", data=training)

The problem comes from my prediction with the new data value. I create a new data frame

a <- data.frame(age = 40, education = "Bachelors", marital.status = "Divorced", relationship = "Wife", sex = "Female", hours.per.week = 40)
predict(modFit, newdata = a)

It returns an error "education has a new level".

I know that the problem comes from those categorical (factor) variables. Somehow, they do not recognize "Bachelors" as a factor that they already have but a new string (new factor).

user3875022 · Accepted Answer

The problem originates from the poor cleaning of the data

When I've downloaded the data, I've recognized a problem that is common with factors in R: the label has extra-space, as a consequence, when you call the label (e.g., "Bachelors" in you example) the system does not recognize it, since in the factor this level has an extra-space:

" Bachelors"

You can see this by calling the levels of the factor: levels(education)

You can remove whitespaces in a read call by setting the strip.white parameter to TRUE

If you upload the dataset in the standard way, you can see that the factors' labels have extra space

# Not Run 
#  adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE)

# levels(adultData$education)

 # [1] " 10th"         " 11th"         " 12th"         " 1st-4th"     
 # [5] " 5th-6th"      " 7th-8th"      " 9th"          " Assoc-acdm"  
 # [9] " Assoc-voc"    " Bachelors"    " Doctorate"    " HS-grad"     
# [13] " Masters"      " Preschool"    " Prof-school"  " Some-college"

If you upload the dataset with strip.white = TRUE, you can see that factors' labels have no extra space

# Not Run 
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE, strip.white = TRUE)

# levels(adultData$education)

 # [1] "10th"         "11th"         "12th"         "1st-4th"      "5th-6th"     
 # [6] "7th-8th"      "9th"          "Assoc-acdm"   "Assoc-voc"    "Bachelors"   
# [11] "Doctorate"    "HS-grad"      "Masters"      "Preschool"    "Prof-school" 
# [16] "Some-college"

I've reproduce the example by updloading the clean dataset, which I've renamed

# Not Run 
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE, strip.white = TRUE)

The dataset is too wide to be published here; it can be easily reproduced from the instruction in the above link. My clean dataset can be dowloaded from here http://www.insular.it/?wpdmact=process&did=OC5ob3RsaW5r

Always take a look at the data

dim(adultData)
head(adultData)
str(adultData)

Call the library you need

library(rpart)
library(caret)

I've selected the same attributes that you selected and I've reduced the data set to 40% only (which is accetable for training)

selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
adultData <- subset(adultData, select = selected)
trainIndex = createDataPartition(adultData$salary, p=0.40, list=FALSE)
training = adultData[ trainIndex, ]

I also added a test-set

test = adultData[ -trainIndex, ]

Model fitting

set.seed(33833)
modFit <- train(salary ~ ., method = "rpart", data=training)

Overall accuracy

prediction <- predict(modFit, newdata=test)

tab <- table(prediction, test$salary)

sum(diag(tab))/sum(tab)

Better testing with the caret package

rpartPred<-predict(modFit,test)

confusionMatrix(rpartPred,test$salary)

Plot the model (not really clear)

library(rattle)

fancyRpartPlot(modFit$finalModel)

Alternative

library(partykit)

finalModel <-as.party(modFit$finalModel)
plot(finalModel)

Prediction with the new data value as specified by you

a <- data.frame(age = 40, education = "Bachelors", marital.status = "Divorced", relationship = "Wife", sex = "Female", hours.per.week = 40)

predict(modFit, newdata = a)

Prediction using rpart on new factor (categorical) variables

Answers (1)

Related Questions