Reputation: 7719
I am working with a rather large dataset (770K records , 2K attributes , almost all of these attributes are binomial but in integer form) ,
I want to apply decision tree on the data with a 10-fold cross validation, but I've some problems :
1.Why does decision tree (e.g. with depth of 10) takes so much time to be trained ? actually I balance the data (as it's imbalanced) to 40% of the original size (~320K records) before training the tree , but it still takes a lot of time , is there any other version of Decision Tree which result the same performance and takes less time ? (Does making the attributes in binomial form makes it faster ?)
2.How can I optimize parameter of decision tree ? Should I optimize it on the the whole X-validation ?
Upvotes: 0
Views: 2740
Reputation: 126
Do you have a reason for having binary attributes marked as integer? The induction is indeed faster for binomial attributes, otherwise the tree induction algorithm needs to find the best split for each attribute for each node.
How long does it take to induce such a tree? Which algorithm are you using?
Regarding parameter optimization: it needs to be done on a separate set inside each X-validation loop. See this workflow as an example of how to do it: http://www.myexperiment.org/workflows/3263.html
Upvotes: 1