eternalmothra
eternalmothra

Reputation: 221

Hacking/cloning sklearn to support pruning Decision Trees?

I wanted to create a decision tree and then prune it in python. However, sklearn does not support pruning by itself. With an internet search, I found this: https://github.com/sgenoud/scikit-learn/blob/4a75a4aaebd45e864e28cfca897121d1199e41d9/sklearn/tree/tree.py

But I don't know how to use the file. I tried:

from sklearn.datasets import load_iris
import tree

clf = tree.DecisionTreeClassifier()
iris = load_iris()

clf = clf.fit(iris.data, iris.target)

But I get the error ValueError: Attempted relative import in non-package. Is that not how I import? Do I need to save the files in a different way? Thank you.

Upvotes: 0

Views: 8125

Answers (3)

yzerman
yzerman

Reputation: 996

Scikit-learn version 0.22 introduced pruning in DecisionTreeClassifier. A new hyperparameter called ccp_alpha lets you calibrate the amount of pruning. See the documentation here.

Upvotes: 0

smci
smci

Reputation: 33938

If you really want to use sgenoud's 7-year-old fork of scikit-learn from back in 2012, git clone on the base directory of the repo, don't just try to copy/clone individual files (of course you'll be losing any improvements/fixes since 2012; way back on v 0.12)

But that idea sounds misconceived: you can get shallower/pruned trees by changing parameters to get early stopping DecisionTreeClassifier parameters max_depth, min_samples, min_samples_leaf, min_impurity_decrease, min_impurity_split. See the doc and play around with the parameters, they do what you're asking for. I've done ML for >10 years and never once seen a need to hack the DT source. There are tons of good reasons not to do this and no good reasons to.

(And if you try to play with the DecisionTreeClassifier parameters and still can't get what you want, post a reproducible code example here using an open-source dataset like iris etc.)

Upvotes: 1

cleros
cleros

Reputation: 4333

In Python, Modules (=Packages in other languages) oftentimes define routines that are interdependent. In these cases, you cannot only download one .py file and put it into your Workspace (i.e. the directory where your sources are located). Instead, download the entire package into that folder, and import relatively, i.e. like this:

# a general import, should only be used if you are absolutely certain that there will be no namespace conflicts
from sklearn.tree.tree import * 
# a more "safe" way is to import the classes/functions you need explicitely
from sklearn.tree.tree import DecisionTreeClassifier

Upvotes: -1

Related Questions