Reputation: 29
I am working on a logistic regression model. I started out with two separate CSV files, one for training data and one for testing data. I created two separate data frames, one for each data set. I am able to fit and train the model just fine but am getting an error when I try to make predictions using the test data.
I am not sure if I am setting my y_train variable properly or if there is another issue going on. I get the following error messages when I run the prediction.
Here is the setup and code for the model"
#Setting x and y values
X_train = clean_df_train[['account_length','total_day_charge','total_eve_charge', 'total_night_charge',
'number_customer_service_calls']]
y_train = clean_df_train['churn']
X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge',
'number_customer_service_calls']]
y_test = clean_df_test['churn']
#Fitting / Training the Logistic Regression Model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False)
#Make Predictions with Logit Model
predictions = logreg.predict(X_test)
#Measure Performance of the model
from sklearn.metrics import classification_report
#Measure performance of the model
classification_report(y_test, predictions)
1522 """
1523
-> 1524 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
1525
1526 labels_given = True
E:\Users\davidwool\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
79 if len(y_type) > 1:
80 raise ValueError("Classification metrics can't handle a mix of {0} "
---> 81 "and {1} targets".format(type_true, type_pred))
82
83 # We can't have more than one value on y_type => The set is no more needed
ValueError: Classification metrics can't handle a mix of continuous and binary targets
Here is the head of the data that I am working with. The churn column is completely blank as it is what I am trying to predict.
clean_df_test.head()
account_length total_day_charge total_eve_charge total_night_charge number_customer_service_calls churn
0 74 31.91 13.89 8.82 0 NaN
1 57 30.06 16.58 9.61 0 NaN
2 111 36.43 17.72 8.21 1 NaN
3 77 42.81 17.48 12.38 2 NaN
4 36 47.84 17.19 8.42 2 NaN
Here are the dtypes as well.
clean_df_test.dtypes
account_length int64
total_day_charge float64
total_eve_charge float64
total_night_charge float64
number_customer_service_calls int64
churn float64
dtype: object
The main problem is that I am used to using sklearn's train_test_split()
function on one dataset where as here I have 2 separate datasets so I am not sure what to set my y-test to be.
Upvotes: 0
Views: 1500
Reputation: 6323
The problem becomes evident by looking at clean_df_test.head()
. I can see there are null values in the column churn
.
As a consequence, y_test
contains null values, and by passing it as y_true
to classification_report()
, you are making the function compare nulls against integers, which is raising an error.
To solve this, try dropping the rows where churn
is NaN
and run the rest of your code as before.
# Drop records where `churn` is NaN
clean_df_test.dropna(axis=0, subset=['churn'], inplace=True)
# Carry on as before
X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge',
'number_customer_service_calls']]
y_test = clean_df_test['churn']
Another way of spotting this issue is to look at the data types of clean_df_test
. From the output, churn
's type is float
, which should not be the case if it was filled exclusively with ones and zeros!
Upvotes: 2