Reputation: 53806
Following this example implementation of Logistic Regression from scikit-learn : https://analyticsdataexploration.com/logistic-regression-using-python/
After running predict , the following is produced :
predictions=modelLogistic.predict(test[predictor_Vars])
predictions
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0], dtype=int64)
I'm failing to understand the array
values. I think they are related to logistic function and are outputting what it thinks the label is but should these values be between 0 and 1 instead of 0 or 1 ?
Reading the doc for predict function :
predict(X)
Predict class labels for samples in X.
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Samples.
Returns:
C : array, shape = [n_samples]
Predicted class label per sample.
Taking the first 5 values : 0, 1, 0, 0, 1 of the returned array how are these interpreted as labels ?
Complete code :
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
train=pd.read_csv('/train.csv')
test=pd.read_csv('/test.csv')
def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")
train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2
return train
train=data_cleaning(train)
test=data_cleaning(test)
predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
X, y = train[predictor_Vars], train.Survived
X.iloc[:5]
y.iloc[:5]
modelLogistic = linear_model.LogisticRegression()
modelLogisticCV= cross_validation.cross_val_score(modelLogistic,X,y,cv=15)
modelLogistic = linear_model.LogisticRegression()
modelLogistic.fit(X,y)
#predict(X) Predict class labels for samples in X.
predictions=modelLogistic.predict(test[predictor_Vars])
Update :
printing first 10 elements from the test dataset :
Can see it matches the predictions of first 10 elements of array :
0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0
So these are the logistic regression predictions on the test
dataset after applying logistic regression to the train
dataset.
Upvotes: 0
Views: 182
Reputation: 114
As stated in the documentation the values returned by the predict
function are class labels (like the values you provided to the fit
function as y). In your case 1 for survived and 0 for not survived.
If you want the scores of each prediction you should use the decision_function
which returns values between -1 and 1.
i hope this answers your question.
Upvotes: 2