Reputation: 3689
I am using scikit-learn to create a decision tree, and its working like a charm. I would like to achieve one more thing: to make the tree to split on an attribute only once.
The reason behind this is because of my very strange dataset. I use a noisy dataset, and i am really interested in the noise as well. My class outcomes are binary let say [+,-]. I have a bunch of attributes with numbers mostly in the range of (0,1).
When scikit-learn creates the tree it splits on attributes multiple times, to make the tree "better". I understand that in this way the leaf nodes become more pure, but thats not the case i would like to achieve.
The thing i did was to define cutoffs for every attribute by counting the the information gain in different cutoffs, and choosing the maximum. In this way with "leave-one-out" and "1/3-2/3" cross validation techniques i get better results then the original tree.
The problem is that when i try to automatize this, i run into a problem around the lower and upper bound e.g. around 0 and 1 because most of the elements will be under/upper that and i get really high informational gain, cause one of the sets are pure, even if it only contains 1-2% of the full data.
All in all i would like to do something to make scikit-learn to only split on an attribute once.
If it cannot be done, do you guys have any advice how to generate those cutoffs in a nice way?
Upvotes: 13
Views: 5090
Reputation: 6333
To answer your question briefly, no, there is no built-in parameter to do this in sklearn
. I tried to do the same a year ago, so I opened an issue requesting the addition of this feature.
sklearn
builds nodes by randomly picking max_features
features from the training dataset and searching for the cutoff that reduces the loss function the most. This exact same process is ran iteratively until some stopping criteria is met (max_depth
, min_samples_leaf
, etc.).
Hence, every feature always has the same probability of being picked, regardless of whether or not it has been used before.
If you're up for it, you can edit the source code of the classifier. In essence, all you need to do is drop the feature that minimizes the loss function after it has been chosen to build a node. That way, the algorithm will be unable to pick that feature again when taking a new sample of max_features
features.
Upvotes: 5
Reputation: 581
I am not giving a method to directly deal with stopping the Classifier from using a feature multiple times. (Although you could do it by defining your own splitter and wiring it in, it is a lot of work.)
I would suggest making sure that you are balancing your classes in the first place, take a look at the class_weight
parameter for details. That should help a lot in your issue. But if that does not work you can still enforce that there are no leafs having too small weight in them using the min_weight_fraction_leaf
or similar parameters as suggested by maxymoo.
Upvotes: 0