Reputation: 183
I am trying to train a decision tree using scikit. One of my feature sets looks something like this:
X = [[0, 0], [1, 1]]
Y = ['a','b'] #class labels
The other feature set looks like this:
Z = [[1,2,0.5],[2,1,0.5],[0.5,2,2]
Y = ['a','b','a'] #class labels
I know that if I have only one feature set, I can do this:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
To consider the Z feature set too, should I do just another:
clf = clf.fit(Z, Y)
Or will this just overwrite my fit for X?
The number of samples in X and Z are different, which is why I can't just zip them together.
Upvotes: 0
Views: 179
Reputation: 4273
Calling .fit()
will overwrite any attributes of the object.
One possibility here would be two train two classifiers on independent features and labels, and combine them to "vote" on the final output.
For example, initialize two classifiers with similar labels but different feature sets:
from sklearn.tree import DecisionTreeClassifier
X = [[0, 0], [1, 1]]
Y1 = ['a','b']
Z = [[1,2,0.5],[2,1,0.5],[0.5,2,2]]
Y2 = ['a','b','a']
clf1 = DecisionTreeClassifier().fit(X, Y1)
clf2 = DecisionTreeClassifier().fit(Z, Y2)
Now if you have new test examples:
X_new = [[0.5, 0.5], [1, 1]]
Z_new = [[1, 1, 1.8], [2.1, 1.5, 0.8]]
y1_pred = clf1.predict(X_new)
y2_pred = clf2.predict(Z_new)
# y1_pred = ['a' 'b'], y2_pred = ['a' 'b']
The easiest way to "vote" would be to take the mode of independent decisions:
from scipy import stats
import numpy as np
y_pred = np.vstack([y1_pred, y2_pred])
print(stats.mode(y_pred)[0])
# [['a' 'b']]
Since the mode is calculated with two objects: tie-breaking may be necessary, but would be domain-specific.
There is a similar problem in semi-supervised machine learning called "co-training" that uses independent feature sets. The literature on this problem may yield some insight to this problem.
Upvotes: 1