Reputation: 2179
I am running my decision tree model using rpart package in R. Here is what I am doing,
Here is the summary of my dataset.
'data.frame': 117919 obs. of 7 variables:
$ Database : Factor w/ 2 levels "DBIL","DBPD": 1 1 1 1 1 1 1 1 1 1 ...
$ Market_Description: Factor w/ 1 level "MY (PM)": 1 1 1 1 1 1 1 1 1 1 ...
$ Manufacturer : Factor w/ 21 levels "21 Century","Abbott Lab",..: 4 3 4 4 4 4 3 3 3 3 ...
$ Brand : Factor w/ 133 levels "","21 Century",..: 34 26 34 34 34 34 26 26 26 26 ...
$ Sub_Brand : Factor w/ 194 levels "","0-6 Bulan",..: 9 6 9 9 9 9 6 6 6 6 ...
$ Age_Group : Factor w/ 5 levels "","Adultenr",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FMT_Category : Factor w/ 10 levels "Adult Powders (excl Super Bev)",..: 5 5 5 5 5 5 5 5 5 5 ...
Here is my script for model..
fit <- rpart(FMT_Category~Database+Market_Description+Manufacturer+Brand+Sub_Brand+Age_Group, data=trainingset)
It has only 117919 observations. I checked the memory.limit in my R and it says 8065 , also mem_used says 40 MB. I am not getting any error, but the model keeps running for a day. so I am not sure what to check here. I expected R to give at least some crappy tree, so I can start from there. I thought it has something to do with the factors, so I read the data with stringAsFactors=FALSE. Still it runs forever. I tried the same data in my python script and weka and it runs fast without any error. Please let me know what I am missing or point me in a right direction on what I should be checking.
Edit -- I just noticed that the issue is the number of levels in Brand and Sub_Brand , this is making the model run forever as it has to run recursively. Any suggestion to handle this?
Upvotes: 4
Views: 3051
Reputation: 551
I don't know much about your dataset but it would be useful if you look your factor variables and try to decrease your factor levels. Is 194 level in Sub_Brand really necessary? How many of data are assigned to those certain levels?
Also, since PCA is not a good approach on categorical data on most cases, you can try one-hot coding as recommended in https://www.quora.com/Can-one-use-dimension-reduction-algorithms-like-PCA-for-categorical-variables-Any-suggestions-on-what-techniques-can-be-used/answer/Gaurav-Dangi-1?srid=YgDw.
Upvotes: 0
Reputation: 898
Upvotes: 1
Reputation: 1328
You can use H2O package for decision trees, random forests and neural networks.
See h2o.gbm
, h2o.randomForest
. This package lets you use all your computer resources.
You can fin and example:
library(h2o)
conn <- h2o.init()
demo(h2o.randomForest)
Upvotes: 0