Trouble understanding output from scikit random forest

Question

Say I have a dataset like this:

5.9;0.645;0.12;2;0.075;32;44;0.99547;3.57;0.71;10.2;5
6;0.31;0.47;3.6;0.067;18;42;0.99549;3.39;0.66;11;6

where the 1st 11 columns indicate features (acidity, chlorides, etc) and the last column indicates the rating given to the item (eg. 5 or 6)

The dataset is trained thus:

target = [x[11] for x in dataset]
train = [x[0:11] for x in dataset]

rf = RandomForestClassifier(n_estimators=120, n_jobs=-1)
rf.fit(train, target)

predictions = rf.predict_proba(testdataset)
print predictions[0]

which prints something like

[ 0.          0.01666667  0.98333333  0.          0.          0.        ]

Now, why does it not output a single classification, eg a 5 or a 6 rating?

The documentation says "The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest" which I'm having trouble understanding.

If you use

print rf.predict(testdataset[-1])
[ 6.  6.  6.  6.  6.  6.  6.  6.  6.  6.  6.]

It prints something more like you'd expect - at least it looks like ratings - but I still don't understand why there's a prediction per feature and not a single prediction taking into account all features?

ogrisel · Accepted Answer

In addition to Diego's answer:

RandomForestClassifier is a classifier to predict class assignment for a discrete number of classes without ordering between the class labels.

If you want to output continuous, floating point rating, you should try to use a regression model such as RandomForestRegressor instead.

You might have to clamp the output to the range [0, 6] as there is no guaranty the model will not output predictions such as 6.2 for instance.

Edit to answer you second point, the predict method expects a list of samples. Hence you should provide it with a list of one sample in your case. Try:

print rf.predict([testdataset[-1]])

or alternatively:

print rf.predict(testdataset[-1:])

I wonder why you don't get an error in that case.

Edit: the ouput does not really make sense: what is the shape of your datasets?

>>> print np.asarray(train).shape

>>> print np.asarray(target).shape

>>> print np.asarray(testdataset).shape

Trouble understanding output from scikit random forest

Answers (2)

Related Questions