scc
scc

Reputation: 10716

Trouble understanding output from scikit random forest

Say I have a dataset like this:

5.9;0.645;0.12;2;0.075;32;44;0.99547;3.57;0.71;10.2;5
6;0.31;0.47;3.6;0.067;18;42;0.99549;3.39;0.66;11;6

where the 1st 11 columns indicate features (acidity, chlorides, etc) and the last column indicates the rating given to the item (eg. 5 or 6)

The dataset is trained thus:

target = [x[11] for x in dataset]
train = [x[0:11] for x in dataset]

rf = RandomForestClassifier(n_estimators=120, n_jobs=-1)
rf.fit(train, target)

predictions = rf.predict_proba(testdataset)
print predictions[0] 

which prints something like

[ 0.          0.01666667  0.98333333  0.          0.          0.        ]

Now, why does it not output a single classification, eg a 5 or a 6 rating?

The documentation says "The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest" which I'm having trouble understanding.

If you use

print rf.predict(testdataset[-1])
[ 6.  6.  6.  6.  6.  6.  6.  6.  6.  6.  6.]

It prints something more like you'd expect - at least it looks like ratings - but I still don't understand why there's a prediction per feature and not a single prediction taking into account all features?

Upvotes: 5

Views: 5277

Answers (2)

ogrisel
ogrisel

Reputation: 40169

In addition to Diego's answer:

RandomForestClassifier is a classifier to predict class assignment for a discrete number of classes without ordering between the class labels.

If you want to output continuous, floating point rating, you should try to use a regression model such as RandomForestRegressor instead.

You might have to clamp the output to the range [0, 6] as there is no guaranty the model will not output predictions such as 6.2 for instance.

Edit to answer you second point, the predict method expects a list of samples. Hence you should provide it with a list of one sample in your case. Try:

print rf.predict([testdataset[-1]])

or alternatively:

print rf.predict(testdataset[-1:])

I wonder why you don't get an error in that case.

Edit: the ouput does not really make sense: what is the shape of your datasets?

>>> print np.asarray(train).shape

>>> print np.asarray(target).shape

>>> print np.asarray(testdataset).shape

Upvotes: 9

Diego
Diego

Reputation: 18379

From the docs, predict_proba returns:

p : array of shape = [n_samples, n_classes], or a list of n_outputs such arrays if n_outputs > 1. The class probabilities of the input samples. Classes are ordered by arithmetical order.

The key here is the last phrase "Classes are ordered by arithmetical order". My guess is that some of your training samples have a class less than 5, which predict_proba assigned a probability of zero, while classes 5 and 6 have probabilities 0.01666667 and 0.98333333, respectively, while another 3 classes, all > 6, have also probability zero.

Upvotes: 3

Related Questions