Reputation:
I'm trying to make a simple logistic regression program for a dataset that looks like this: https://i.sstatic.net/NV5TM.jpg
My program should use logistic regression on the dataset, and output some info about the results of the regression. Using an example I made this code:
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
col_names = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean','compactness_mean', 'symmetry_se', 'perimeter_worst', 'smoothness_worst', 'concavity_worst']
# load dataset
data = pd.read_csv("DatasetTest.csv", header=None, names=col_names)
data.head()
feature_cols = ['diagnosis', 'radius_mean','texture_mean','perimeter_mean','area_mean', 'smoothness_mean','compactness_mean', 'symmetry_se', 'perimeter_worst','smoothness_worst', 'concavity_worst']
X = data[feature_cols]
y = data.diagnosis
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
When running the code this error occurs:
could not convert string to float: 'concavity_worst'
Couldn't find a similar problem on stackoverflow. After doing research I found the function fit() apparently can't take strings. But I'm not sure how to convert the strings to floats. After doing some googling I still couldn't find a solution working for this situation.
Example used for the code: https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f
Upvotes: 1
Views: 1325
Reputation: 36624
According to the pandas documentation of pd.read_csv
, you need to
Explicitly pass header=0 to be able to replace existing names
If you don't, it will take the file header as part of the data. So, now all your column names are mixed with your data, and all of your columns contain a string. That will crash the regression because it can't take strings as input.
Upvotes: 0