Reputation: 417
I'm trying to construct a decision tree with scikit-learn's DecisionTreeClassifier
. My data has numeric features consisting of integers and float values.
When constructing the decision tree, the integer features get converted to float.
For eg: if A is a feature that can only have integer values from 1-12, splitting criterion such as "A < 5.5" or "A < 3.1" come up in the tree. I don't want float value splitting criteria for A.
The depth of the tree increases if integer features get converted to float. How do I restrict integer features being converted to float?
Also, scikit-learn's DecisionTreeClassifier
doesn't allow categorical features. Are there any alternative packages/libraries for constructing decision trees which allow Categorical Features?
Upvotes: 3
Views: 2550
Reputation: 8270
Regarding integer vs floating point for decision trees, it does not matter for building a tree. Any split between two consecutive integers will be equivalent. It will never make two splits between the same pair of consecutive integers, because by doing so, one of the leaves will have no samples. It will generate an equivalent model regardless if integers or floats are used.
Using scikit-learn you are able to use categorical features using a LabelBinarizer
. This will create a matrix of dummy values (one hot encoded) for the categories.
Here is an example:
from sklearn.preprocessing import LabelBinarizer
from sklearn.tree import DecisionTreeClassifier
import numpy as np
Define features
month = ['Jan', 'Feb', 'Jan', 'Mar']
day = [1, 15, 30, 5]
Define category targets
y = [0, 1, 1, 1]
Build dummies:
lb = LabelBinarizer()
X_month_dummies = lb.fit_transform(month)
X_month_dummies
is then:
array([[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
Combine dummies with numeric features (day)
X = np.hstack([np.column_stack([day]), X_month_dummies])
Build classifier.
clf = DecisionTreeClassifier()
clf.fit(X, y)
Upvotes: 4