Reputation: 3076
Dataset 0-9 columns: float features (parameters of a product) 10 column: int labels (products)
Goal
Calculate an 0-1 classification certainty score for the labels (this is what my current code should do)
Calculate the same certainty score for each “product_name”(300 columns) at each rows(22'000)
ERROR I use sklearn.tree.DecisionTreeClassifier. I am trying to use "predict_proba" but it gives an error.
Python CODE
data_train = pd.read_csv('data.csv')
features = data_train.columns[:-1]
labels = data_train.columns[-1]
x_features = data_train[features]
x_label = data_train[labels]
X_train, X_test, y_train, y_test = train_test_split(x_features, x_label, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
clf = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
class_probabilitiesDec = clf.predict_proba(y_train)
#ERORR: ValueError: Number of features of the model must match the input. Model n_features is 10 and input n_features is 16722
print('Decision Tree Classification Accuracy Training Score (max_depth=3): {:.2f}'.format(clf.score(X_train, y_train)*100) + ('%'))
print('Decision Tree Classification Accuracy Test Score (max_depth=3): {:.2f}'.format(clf.score(X_test, y_test)*100) + ('%'))
print(class_probabilitiesDec[:10])
# if I use X_tranin than it jsut prints out a buch of 41 element vectors: [[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
[....
FEATURES (COLUMNS)
(last columns are the labels) 0 1 1 1 1.0 1462293561 1462293561 0 0 0.0 0.0 1 1 2 2 2 8.0 1460211580 1461091152 1 1 0.0 0.0 2 2 3 3 3 1.0 1469869039 1470560880 1 1 0.0 0.0 3 3 4 4 4 1.0 1461482675 1461482675 0 0 0.0 0.0 4 4 5 5 5 5.0 1462173043 1462386863 1 1 0.0 0.0 5
CLASSES COLUMNS (300 COLUMNS OF ITEMS)
HEADER ROW: apple gameboy battery .... SCORE in 1st row: 0.763 0.346 0.345 .... SCORE in 2nd row: 0.256 0.732 0.935 ....
ex.: of similar scores used when someone image classify cat VS. dog and the classification gives confidence scores.
Upvotes: 2
Views: 1383
Reputation: 3082
You cannot predict the probability of your labels.
predict_proba
predicts the probability for each label from your X Data, thus:
class_probabilitiesDec = clf.predict_proba(X_test)
What you postet as "when i use X_train":
[[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
Is a list of the probability to be true for every possible label.
EDIT
After reading your comments predict proba is exactly what you want.
Lets make an example. In the following code we have a classifier with 3 classes: either 11, 12 or 13.
If the input is 1 the classifier should predict 11
If the input is 2 the classifier should predict 12
...
If the input is 7 the classifier should predict 13
clf = DecisionTreeClassifier()
clf.fit([[1],[2],[3],[4],[5],[6],[7]], [[11],[12],[13],[13],[12],[11],[13]])
now if you have test data with a single row e.g. 5 than the classifier should predict 12. So lets try that.
clf.predict([[5]])
And voila: the result is array([12])
if we want a probability then predict proba is the way to go:
clf.predict_proba([[5]])
and we get [array([0., 1., 0.])]
In that case the array [0., 1., 0.]
means :
0% probability for class 11
100% probability for class 12
0% probability for class 13
If i'm correct thats exactly what you want. You can even map that to the names of your classes with:
probabilities = clf.predict_proba([[5]])[0]
{clf.classes_[i] : probabilities[i] for i in range(len(probabilities))}
which gives you a dictionary with probabilities for class names:
{11: 0.0, 12: 1.0, 13: 0.0}
Now in your case you have way more classes than only [11,12,13] so the array gets longer. And for every row in your dataset predict_proba creates an array, so for more than a single row of data your output becomes a matrix.
Upvotes: 5