ASH
ASH

Reputation: 20302

Very low Regression scores and ultra low Classification scores

My dataset looks like this (first 20 records). The script that I am testing is below.

Credit_Score    Net_Advance APR Mosaic  Time_at_Address Time_in_Employment  Time_with_Bank  Value_of_Property   Total_Outstanding_Balances  Age
918 3000    14.4    46  132 288 168 178,000.00  64406   46
903 21000   7.9 16  288 37  300 180,000.00  31614   59
1060    7200    7.9 17  276 154 369 199,000.00  26045   56
839 8000    16.9    47  48  82  216 120,000.00  181217  33
1057    7650    7.4 55  156 342 510 180,000.00  63811   49
913 33000   9.4 59  18  170 240 205,000.00  219003  45
840 8000    15.9    12  293 77  317 179,000.00  90797   51
961 5300    11.9    43  163 351 243 92,000.00   84624   49
901 12000   11.9    11  108 24  180 180,000.00  158678  55
915 6000    12.9    49  36  72  384 120,000.00  2785    48
840 10150   12.4    24  37  58  261 110,000.00  109231  27
968 18000   8.4 24  2   168 420 120,000.00  85502   49
904 10000   8.7 46  24  8   174 150,000.00  157718  37
924 8000    9.9 47  418 439 379 120,000.00  2827    72
896 5000    9.4 15  4   240 300 246,000.00  257560  48
804 5000    17.1    44  12  36  240 165,000.00  160650  37
840 21200   11.5    44  339 133 231 117,000.00  31316   50
862 2000    31.9    18  44  63  186 291,000.00  279819  35
785 1100    40.9    23  94  54  150 120,000.00  789 39
847 20000   9.4 16  237 309 326 272,000.00  170348  59

Here is my actual code.

# Using both Regression and Classification to measure the Credit Score of a customer
import numpy as np
import pandas as pd
from sklearn import datasets
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection# random forest model creation
from sklearn.model_selection import train_test_split# implementing train-test-split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

# load data from CSV into data frame and use a specific argument 'thousands=',''
df = pd.read_csv("C:\\my_path\\credit.csv", encoding="ISO-8859-1",sep=',', thousands=',')

# view a small sample of data for piece of mind
df.head()

from sklearn.ensemble import RandomForestClassifier
features = np.array(['Net_Advance', 'APR', 'Mosaic', 'Mosaic_Class', 'Time_at_Address', 'Number_of_Dependants', 'Time_in_Employment', 'Income_Range', 'Time_with_Bank', 'Value_of_Property', 'Outstanding_Mortgage_Bal', 'Total_Outstanding_Balances', 'Age'])
clf = RandomForestClassifier()
clf.fit(df[features], df['Credit_Score'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()

enter image description here

# try PCA & LDA methodologies
# first PCA ...
X = df[['Net_Advance', 'APR', 'Mosaic', 'Mosaic_Class', 'Time_at_Address', 'Number_of_Dependants', 'Time_in_Employment', 'Income_Range', 'Time_with_Bank', 'Value_of_Property', 'Outstanding_Mortgage_Bal', 'Total_Outstanding_Balances', 'Age']]
y = df[['Credit_Score']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Performance Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))

# Result:
Accuracy 0.009062326613648974

So, my question is, how can the learning be so, so, so low? This is basically a zero learning outcome, based on the code and results displayed above. Also, when I test a few other concepts/experiments, as described below, I see accuracy results around 60%, at best. I would expect around 90+ percent accuracy results... Here is the code that I am testing.

    # Baggin Classifier
    from sklearn.ensemble import BaggingClassifier
    from sklearn import tree
    model = BaggingClassifier(tree.DecisionTreeClassifier(random_state=1))
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # around 5% accorate.  horrible!


    # Bagging Regressor
    from sklearn.ensemble import BaggingRegressor
    model = BaggingRegressor(tree.DecisionTreeRegressor(random_state=1))
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # almost 65% accurate; better but not great!


    # AdaBoostClassifier
    from sklearn.ensemble import AdaBoostClassifier
    model = AdaBoostClassifier(random_state=1)
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # just 1% accurate!  no way!!


    # AdaBoostRegressor
    from sklearn.ensemble import AdaBoostRegressor
    model = AdaBoostRegressor()
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # around 60%. just ok.


    # GradientBoostingClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    model= GradientBoostingClassifier(learning_rate=0.1,random_state=1)
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # around 60%. just ok. 


    # GradientBoostingRegressor
    from sklearn.ensemble import GradientBoostingRegressor
    model= GradientBoostingRegressor()
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # around 60%. just ok.


    # XGBClassifier
    import xgboost as xgb
    model=xgb.XGBClassifier(random_state=1,learning_rate=0.1)
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # around 60%. just ok.


    # XGBRegressor
    import xgboost as xgb
    model=xgb.XGBRegressor()
    model.fit(x_train, y_train)
    model.score(x_test,y_test)
    # around 60%. just ok.

Any idea why the this, for one thing, is wrong???

# XGBRegressor
import xgboost as xgb
model=xgb.XGBRegressor()
model.fit(x_train, y_train)
model.score(x_test,y_test)

Upvotes: 1

Views: 430

Answers (1)

desertnaut
desertnaut

Reputation: 60321

There are several issues here, but let me start with a hint:

There is a fundamental reason why many of the models you have tried here come in two "versions" - Classifier and Regressor; classification & regression are two different and mutually exclusive types of problems, and only one of them types is applicable to a particular problem, and never both.

The criterion for a problem being a classification or a regression one is determined by the target variable; here, your Credit_Score is a numeric variable, hence you are in a regression setting. As a corollary, all your experiments here with classifier models are meaningful and invalid, and they can be safely thrown away (and no wonder they showed such ultra low "accuracy" performance).

The other issue is your use of "accuracy"; this term in ML has a very specific meaning, which is not exactly the same with its use in every day life: it is the percentage of samples correctly classified; as already hinted in this definition, accuracy is applicable only in classification problems, and its use in regression ones (such as yours here) is meaningless.

Each one of the scikit-learn models you have employed here comes with its own score method, and it would be advisable to check the docs to be sure what exactly is the score applicable for a specific model; for good or bad, scikit-learn developers have made the choice to usually use the R^2 (or R-squared) coefficient in their regressor models. Although usually (not always) this is a number in [0, 1], we normally don't use it as a percentage (as you do here), and it is not the "accuracy" of a regression model (again, such a thing does not exist).

I have explained elsewhere why the choice of R^2 is an unfortunate one in ML predictive settings; quoting:

the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):

In particular when using a test set, it's a bit unclear to me what the R^2 means.

with which I certainly concur.

So, you should better get rid of the native score methods of the regressors used, and switch instead to a score like mean squared error (MSE), mean absolute error (MAE), root mean squared error (RMSE) etc, which are the ones practically used in predictive ML settings (and are all available in scikit-learn).

The "downside" of these regression metrics is that by definition they cannot be expressed in a percentage, so you need some extra inspection in order to be sure if the result is good enough for your case or not.


A last notice regarding the presentation.

According to my experience here at SO, it is very strange that something like that (i.e. using classifiers in a regression problem), which happens more frequently than one may think, was not picked up by someone in the community for as long as ~ 7 hours before my answer; my guess is that this was due to the combination of an intimidating amount of code posted, plus a rather unfortunate and arguably bad title (I almost passed myself). Just saying...

Upvotes: 3

Related Questions