Demetri Pananos
Demetri Pananos

Reputation: 7404

Probability and Machine Learning

I am using python to do a bit of machine learning.

I have a python nd array with 2000 entries. Each entry has information about some subjects and at the end has a boolean to tell me if they are a vampire or not.

Each entry in the array looks like this:

[height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]

My goal is to be able to give a probability that a new subject is a vampire given the data shown above for the subject.

I have used sklearn to do some machine learning for me:

clf = tree.DecisionTreeRegressor()

clf=clf.fit(X,Y)


print clf.predict(W)

Where W is an array of data for the new subject. The script I have written returns booleans, but I would like it to return probabilities. How can I modify it?

Upvotes: 5

Views: 394

Answers (4)

mathieujofis
mathieujofis

Reputation: 349

You're using a regressor but you probably want to use a classifier.

You'll also want to use a classifier that can give you posterior probabilities like a decision tree or logistic regression. Other classifiers may give you a score (some kind of confidence measure) which may also work for your needs.

Upvotes: 0

codeslord
codeslord

Reputation: 2368

If you are using DecisionTreeRegressor() then you may use the score function to determine the coefficient of determination R^2 of the prediction.

Please find the below link to the documentation.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

Also you can list out the cross validation score (for 10 samples) as below

from sklearn.model_selection import cross_val_score


clf = tree.DecisionTreeRegressor()

clf=clf.fit(X,Y)

cross_val_score(clf, X, Y, cv=10)

print clf.predict(W)

Which gives an output something similar to this,

array([ 0.61..., 0.57..., -0.34..., 0.41..., 0.75...,
        0.07..., 0.29..., 0.33..., -1.42..., -1.77...])

Upvotes: 3

wendykan
wendykan

Reputation: 1

You want to use a classifier that gives you a probability. Also, you will want to make sure in your testing array W, the data points are not replicates of any of your training data. If it matches exactly with any of your training data, it thinks it's definitely vampire or definitely not vampire, so will give you 0 or 1.

Upvotes: 0

BrenBarn
BrenBarn

Reputation: 251408

Use a DecisionTreeClassifier instead of a regressor, and use the predict_proba method. Alternatively, you could use a logistic regression (also available in scikit learn.)

The basic idea is this:

clf = tree.DecisionTreeClassifier()

clf=clf.fit(X,Y)


print clf.predict_proba(W)

Upvotes: 2

Related Questions