Little
Little

Reputation: 3477

problem with decision tree applied to dataset

I was testing to program a decision tree by using R and decided to use the car dataset from UCI, available here.

According to the authors it has 7 attributes which are:

CAR                      car acceptability
   . PRICE                  overall price
   . . buying               buying price
   . . maint                price of the maintenance
   . TECH                   technical characteristics
   . . COMFORT              comfort
   . . . doors              number of doors
   . . . persons            capacity in terms of persons to carry
   . . . lug_boot           the size of luggage boot
   . . safety               estimated safety of the car

so I want to use a DT as a classifier for getting the car acceptability considering the buying price, maint, comfort, doors, persons, lug_boot and safety.

First of all I extracted the first column as the dependent variable and then I noticed that the data was arrange in order; depending on the value of the first column (very high, high, medium,low). For this reason, I decided to shuffle the data. My code is the following:

car_data<-read.csv("car.data")
library(C50)
set.seed(12345)
car_data_rand<-car_data[order(runif(1727)),]
car_data<-car_data_rand
car_data_train<-car_data[1:1500,]
car_data_test<-car_data[1501:1727,]
answer<-data_train$vhigh
answer_test<-data_test$vhigh
#deleting the dependent variable or y from the data
car_data_train$vhigh<-NULL
car_data_test$vhigh<-NULL
car_model<-C5.0(car_data_train,answer)
summary(car_model)

Here I get an awful error:

Evaluation on training data (1500 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         7  967(64.5%)   <<

What am I doing wrong?

Upvotes: 0

Views: 52

Answers (1)

Julius Vainora
Julius Vainora

Reputation: 48211

  1. In the middle of your code you have data_train and data_test rather than car_data_train and car_data_test.

  2. While the error is high, there is nothing wrong with it. Note that

1 - table(answer) / length(answer)
# answer
#      high       low       med     vhigh 
# 0.7466667 0.7566667 0.7426667 0.7540000 

That means that if you naively always guessed "low", your error would be 75.6%. So, there is an improvement, by ~11.1%. The fact that it's somewhat low means that the predictors are not great.

  1. Lastly, there is inconsistency: you say that you want to model the car acceptability, while your code is about the buying variable. Now fixing that leads to just 1.1% error. However, in this case your sample is very imbalanced:

1 - table(answer) / length(answer)
# answer
#       acc      good     unacc     vgood 
# 0.7773333 0.9600000 0.3020000 0.9606667 

That is, by always guessing unacc you again could already get just 30.2% error. The improvement of 29.1%, however, is clearly larger.

Upvotes: 2

Related Questions