Reputation: 81
I'm currently working on a model to predict a probability of fatality once a person is infected with the Corona virus. I'm using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc. It was suggested to use a decision tree, which I've already built. Since I'm new to decision trees I would like some assistance. I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output. How can I achieve this? Also I want to play around with samples by inputting the data myself and see what the outcome is. For instance: let's take someone who is 40, male etc. and calculate what its survival chance is. How can I achieve this? I've attached the code below:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import random as rnd
filename = '/Users/sef/Downloads/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)
df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = YHat
print(df)
Upvotes: 1
Views: 2370
Reputation: 2478
you can use the method "predict_proba" of the DecisionTreeClassifier to compute the probabilities instead of the binary classification values.
In order to test individual data, that you can create by hand, you have to create an array of the shape of your X_test data (just that it only has one entry). Then you can use that with model.predict(array) or model.predict_proba(array).
By the way, your tree is currently not useful for retrieving probabilities. There is an article that explains the problem very well: https://web.archive.org/web/20210507022823/https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html
So you can fix your code by defining the max_depths of your tree:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import random as rnd
filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)
model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
model.fit(X_train, Y_train)
rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)
df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = list(YHat)
print(df)
Upvotes: 0
Reputation: 253
Use the function called predict_proba
model.predict_proba(X_test)
To the second part of your question, here is what you will have to do. Create your own custom dataset with the exact same column names as you had trained. Read your data from a csv and apply the same encoder values if any.
You can also save your label encoder object in a much more efficient way.
label = preprocessing.LabelEncoder()
label_encoded_columns=['Date_statistics_type', 'Agegroup', 'Sex', 'Province', 'Hospital_admission', 'Municipal_health_service', 'Deceased']
for col in label_encoded_columns:
dataframe[col] = dataframe[col].astype(str)
Label_Encoder = labelencoder.fit(dataframe[label_encoded_columns].values.flatten())
Encoded_Array = (Label_Encoder.transform(dataframe[label_encoded_columns].values.flatten())).reshape(dataframe[label_encoded_columns].shape)
LE_Dataframe=pd.DataFrame(Encoded_DataFrame,columns=label_encoded_columns,index=dataframe.index)
LE_mapping = dict(zip(Label_Encoder.classes_,Label_Encoder.transform(Label_Encoder.classes_).tolist()))
#####This should give you dictionary in the form for all your list of values.
##### for eg: {'Apple':0,'Banana':1}
For your second part of the question, there can be two ways. The first one is pretty straightforward, where in you can use values of X_test to give you a resulting prediction. model.predict(X_test.iloc[0:30]) ###First 30 rows model.predict_proba(X_test.iloc[0:30])
In the second one, if you are talking about introducing new data, then in that case, you will have to label encode the raw data once again.
If that data is not present, it may give you never seen before values error.
Refer to this link in that case
Upvotes: 0
Reputation: 87
Decision Tree can also estimate the probability than an instance belongs to a particular class. Use predict_proba() as below with your train feature data to return the probability of various class you want to predict. model.predict() returns the class which has the highest probability
model.predict_proba()
Upvotes: 0