yeetfan
yeetfan

Reputation: 11

My RandomForest keeps returning the exact same probabilities for model.predict_proba() regardless of input

The code is supposed to predict likelihood of diabetes given parameters like glucose, blood pressure, BMI and age:

I first had to trim out the columns I didn't need:

 df=pd.read_csv('diabetes.csv')
    keep_col = ['Glucose', 'BloodPressure','BMI', 'Age', 'Outcome']
    df = df[keep_col]
    df.to_csv('newFile.csv', index=False)

Then I had to even out the data set because there were twice as many patients that did not have diabetes:

shuffled_df = df.sample(frac=1,random_state=4)
fraud_df = df.loc[shuffled_df['Outcome'] == 1]
non_fraud_df = shuffled_df.loc[shuffled_df['Outcome'] == 0].sample(n=684,random_state=42)
df = pd.concat([fraud_df, non_fraud_df])

Making the training and testing sets:

X = df.iloc[:,:-1].values
Y = df.iloc[:,-1].values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.25)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Training the model:

model = RandomForestClassifier(n_estimators=1, criterion='entropy', random_state=1)
model.fit(X_train, Y_train)

Testing accuracy, this usually returns anywhere from around .95-1:

model.score(X_train, Y_train)

Printing number of true negatives, true positives, false negatives, and false positives:

cm = confusion_matrix(Y_test, model.predict(X_test))

TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]

print(cm)

print('Model Test Accuracy = {}'. format( (TP + TN )/ (TP + TN + FN + FP) ) )

Model Test Accuracy is usually above 80%

Finally, when I go use the model to make a new prediction such as:

model.predict_proba([[140,77,25,30]])

It always returns the same value such as "array([[.3, .6]])" even when I switch glucose from 140 to 190 or if I switch BMI from 25 to 30 and etc. The only time that the probabilities change is when I change the number of estimators but even then, they don't change with different inputs either.

Any help with this problem would be much appreciated!

Upvotes: 1

Views: 970

Answers (1)

Markus Eyting
Markus Eyting

Reputation: 39

As commented, a random forest typically consists of several trees (number of estimators) so you should change this to e.g. 100. With such few variables it is not unlikely that you get the same probabilities for different X as they might all end up in the same leaf. Have you tried changing the X values more significantly, like setting them all to 0 or very small vs very high values? And what does it return if you run model.predict_proba(X_test)? Also, I don't know how deep your tree is, so you might have to change that too in order to get more heterogeneity.

Just fyi, a nice guide to optimizing the parameters of the random forest: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

Upvotes: 1

Related Questions