ds_user
ds_user

Reputation: 2179

Decision tree model running for long time

I am running my decision tree model using rpart package in R. Here is what I am doing,

  1. Loading my data using read.csv
  2. Remove unwanted columns
  3. Split my dataset in to training and test
  4. Fitting my model on the training set -- This is running for whole day.

Here is the summary of my dataset.

'data.frame':   117919 obs. of  7 variables:
 $ Database          : Factor w/ 2 levels "DBIL","DBPD": 1 1 1 1 1 1 1 1 1 1 ...
 $ Market_Description: Factor w/ 1 level "MY (PM)": 1 1 1 1 1 1 1 1 1 1 ...
 $ Manufacturer      : Factor w/ 21 levels "21 Century","Abbott Lab",..: 4 3 4 4 4 4 3 3 3 3 ...
 $ Brand             : Factor w/ 133 levels "","21 Century",..: 34 26 34 34 34 34 26 26 26 26 ...
 $ Sub_Brand         : Factor w/ 194 levels "","0-6 Bulan",..: 9 6 9 9 9 9 6 6 6 6 ...
 $ Age_Group         : Factor w/ 5 levels "","Adultenr",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ FMT_Category      : Factor w/ 10 levels "Adult Powders (excl Super Bev)",..: 5 5 5 5 5 5 5 5 5 5 ...

Here is my script for model..

fit <- rpart(FMT_Category~Database+Market_Description+Manufacturer+Brand+Sub_Brand+Age_Group, data=trainingset)

It has only 117919 observations. I checked the memory.limit in my R and it says 8065 , also mem_used says 40 MB. I am not getting any error, but the model keeps running for a day. so I am not sure what to check here. I expected R to give at least some crappy tree, so I can start from there. I thought it has something to do with the factors, so I read the data with stringAsFactors=FALSE. Still it runs forever. I tried the same data in my python script and weka and it runs fast without any error. Please let me know what I am missing or point me in a right direction on what I should be checking.

Edit -- I just noticed that the issue is the number of levels in Brand and Sub_Brand , this is making the model run forever as it has to run recursively. Any suggestion to handle this?

Upvotes: 4

Views: 3051

Answers (3)

boyaronur
boyaronur

Reputation: 551

I don't know much about your dataset but it would be useful if you look your factor variables and try to decrease your factor levels. Is 194 level in Sub_Brand really necessary? How many of data are assigned to those certain levels?

Also, since PCA is not a good approach on categorical data on most cases, you can try one-hot coding as recommended in https://www.quora.com/Can-one-use-dimension-reduction-algorithms-like-PCA-for-categorical-variables-Any-suggestions-on-what-techniques-can-be-used/answer/Gaurav-Dangi-1?srid=YgDw.

Upvotes: 0

Yanhui Zhou
Yanhui Zhou

Reputation: 898

  1. Use Brand and Sub_Brand as integers.
  2. Get rid of Market_Description which has only 1 level, and no help for decision tree model.

Upvotes: 1

AK47
AK47

Reputation: 1328

You can use H2O package for decision trees, random forests and neural networks. See h2o.gbm, h2o.randomForest. This package lets you use all your computer resources.

You can fin and example:

library(h2o)
conn <- h2o.init()
demo(h2o.randomForest)

Upvotes: 0

Related Questions