Reputation: 41
### my dataset
import pandas as pd
csv_url = 'https://raw.githubusercontent.com/ga59wig/419B/main/data.csv?token=GHSAT0AAAAAACKSAONPXCVHO2L4IGQCID72ZK3422Q'
gdf = pd.read_csv(csv_url)
gdf = gdf.dropna()
#for accuracy_score
from sklearn import metrics as metrics
# for train_test_split
from sklearn import model_selection as model_selection
# For the classifier
from sklearn import ensemble as ensemble
cols = gdf.columns
cols = cols[1:]
x = gdf[cols].values
y = gdf["Relative Height bin98 (cm)"]
print(y)
print(x)
x_train, x_test, y_train, y_test = model_selection.train_test_split(x,y, test_size=0.2, random_state=1569)
adaBoost_model = ensemble.AdaBoostClassifier(n_estimators=200, learning_rate=1e-05)
adaBoost_model.fit(x_train, y_train)
adaBoost_prediction = adaBoost_model.predict(x_test)
adaBoost_accuracy = metrics.accuracy_score(adaBoost_prediction, y_test)
adaBoost_confusion_matrix = metrics.confusion_matrix(adaBoost_prediction, y_test)
adaBoost_classification_report = metrics.classification_report(adaBoost_prediction, y_test)
print("Accuracy:", adaBoost_accuracy)
print(adaBoost_confusion_matrix)
print(adaBoost_classification_report)
OUTPUT:
Accuracy: 0.4168228190714137
[[8008 3144 4233 3827]
[ 0 0 0 0]
[ 0 0 0 0]
[ 0 0 0 0]]
precision recall f1-score support
3 - 6 m 1.00 0.42 0.59 19212
6 - 10 m 0.00 0.00 0.00 0
<3 m 0.00 0.00 0.00 0
> 10 m 0.00 0.00 0.00 0
accuracy 0.42 19212
macro avg 0.25 0.10 0.15 19212
weighted avg 1.00 0.42 0.59 19212
I'm new to ML in Python and have some trouble causing an error in my data or my code. I get only one class in my predicts. The accuracy is always the same (because of that I assume). What am I doing wrong?
I tried already:
setting learning_rate to a low value
converting y to integer
checking my data
with 'iris' example data it's working
Upvotes: 1
Views: 81
Reputation: 5095
The data looks a bit suspect to me...a histogram plot indicates that even though the dataset has about 100k rows, most are duplicates:
gdf.hist(grid=False, bins=50)
plt.tight_layout()
This is confirmed by counting up the unique rows. In the table below, even though about 41k (almost half) the data set falls under the y=3-6m
label, you can see that they are duplicates of merely 4 unique values. To me it seems like something has gone amiss with a prior processing step for the data, and I think that'd be the place to start looking.
gdf.groupby('Relative Height bin98 (cm)').value_counts().to_frame()
Upvotes: 1