Reputation: 2964
In the following minimal reproducible dataset, I split a dataset into train and test dataset, fit a logistic regression to the training dataset with scikit learn and predict y based on the x_test
.
However the y_pred
or y predictions, are correct only if inversed (e.g 0 = 1, and 1 = 0) calculated like so: 1 - y_pred
.
Why is this the case? I cant figure out if it is something relating to the scaling of x (I have tried with and without the StandardScaler
), something related to the logistic regression, or the accuacy score calculation.
In my larger dataset, this is also the case even when using different seeds as random state. I have also tried this Logistic Regression with the same result.
EDIT as pointed out by @Nester it works without standard scaler for this minimal dataset. Larger dataset avaliable here, standardScaler
does nothing on this larger dataset, I'll keep the OP smaller dataset as it might help in explaining the problem.
# imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
# small dataset
Y = [1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
X =[[0.38373581],[0.56824121],[0.39078066],[0.41532221],[0.3996311 ]
,[0.3455455 ],[0.55867358],[0.51977073],[0.51937625],[0.48718916]
,[0.37019272],[0.49478954],[0.37277804],[0.6108499 ],[0.39718093]
,[0.33776591],[0.36384773],[0.50663667],[0.3247984 ]]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=42, stratify=Y)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
y_pred = 1 - y_pred # <- why?
accuracy_score(y_test,y_pred)
1.0
Larger dataset accuracy:
accuracy_score(y_test,y_pred)
0.7 # if inversed
thanks for reading
Upvotes: 2
Views: 1137
Reputation: 16966
X and Y does not have any relationship at all. Hence, the model is performing poorly. There is reason to say that 1-pred is performing better. If you have more than two classes, then situation would be even more worse.
%matplotlib inline
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, stratify=Y)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(x_train, y_train)
import matplotlib.pyplot as plt
plt.scatter(clf.named_steps['standardscaler'].transform(x_train),y_train)
plt.scatter(clf.named_steps['standardscaler'].transform(x_test),y_test)
print(clf.score(x_test,y_test))
The relationship is same for your bigger dataset as well.
Try to identify other features, which can help you in predicting Y.
Upvotes: 1
Reputation: 159
Have you tried running the model without the StandardScaler()? Your data looks like it doesn't need to be re-scaled.
Upvotes: 1