Jonas Apelt
Jonas Apelt

Reputation: 41

Python adaBoost all predicts are same class

### my dataset
import pandas as pd
csv_url = 'https://raw.githubusercontent.com/ga59wig/419B/main/data.csv?token=GHSAT0AAAAAACKSAONPXCVHO2L4IGQCID72ZK3422Q'
gdf = pd.read_csv(csv_url)

gdf = gdf.dropna()

#for accuracy_score
from sklearn import metrics as metrics

# for train_test_split
from sklearn import model_selection as model_selection

# For the classifier
from sklearn import ensemble as ensemble 

cols = gdf.columns
cols = cols[1:]

x = gdf[cols].values
y = gdf["Relative Height bin98 (cm)"]

print(y)
print(x)

x_train, x_test, y_train, y_test = model_selection.train_test_split(x,y, test_size=0.2, random_state=1569)

adaBoost_model = ensemble.AdaBoostClassifier(n_estimators=200, learning_rate=1e-05)
adaBoost_model.fit(x_train, y_train)
adaBoost_prediction = adaBoost_model.predict(x_test)

adaBoost_accuracy = metrics.accuracy_score(adaBoost_prediction, y_test)
adaBoost_confusion_matrix = metrics.confusion_matrix(adaBoost_prediction, y_test)
adaBoost_classification_report = metrics.classification_report(adaBoost_prediction, y_test)

print("Accuracy:", adaBoost_accuracy)
print(adaBoost_confusion_matrix)
print(adaBoost_classification_report)

OUTPUT:

Accuracy: 0.4168228190714137
[[8008 3144 4233 3827]
 [   0    0    0    0]
 [   0    0    0    0]
 [   0    0    0    0]]
              precision    recall  f1-score   support

     3 - 6 m       1.00      0.42      0.59     19212
    6 - 10 m       0.00      0.00      0.00         0
        <3 m       0.00      0.00      0.00         0
      > 10 m       0.00      0.00      0.00         0

    accuracy                           0.42     19212
   macro avg       0.25      0.10      0.15     19212
weighted avg       1.00      0.42      0.59     19212

I'm new to ML in Python and have some trouble causing an error in my data or my code. I get only one class in my predicts. The accuracy is always the same (because of that I assume). What am I doing wrong?

I tried already:

Upvotes: 1

Views: 81

Answers (1)

MuhammedYunus
MuhammedYunus

Reputation: 5095

The data looks a bit suspect to me...a histogram plot indicates that even though the dataset has about 100k rows, most are duplicates:

enter image description here

gdf.hist(grid=False, bins=50)
plt.tight_layout()

This is confirmed by counting up the unique rows. In the table below, even though about 41k (almost half) the data set falls under the y=3-6m label, you can see that they are duplicates of merely 4 unique values. To me it seems like something has gone amiss with a prior processing step for the data, and I think that'd be the place to start looking.

enter image description here

gdf.groupby('Relative Height bin98 (cm)').value_counts().to_frame()

Upvotes: 1

Related Questions