user11459133
user11459133

Reputation:

Logistic regression in python Erorr: ValueError: could not convert string to float: 'concavity_worst'

I'm trying to make a simple logistic regression program for a dataset that looks like this: https://i.sstatic.net/NV5TM.jpg enter image description here

My program should use logistic regression on the dataset, and output some info about the results of the regression. Using an example I made this code:

import matplotlib.pyplot as plt
from scipy import stats

import pandas as pd
col_names = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean','compactness_mean', 'symmetry_se', 'perimeter_worst', 'smoothness_worst', 'concavity_worst']

# load dataset
data = pd.read_csv("DatasetTest.csv", header=None, names=col_names)
data.head()

feature_cols = ['diagnosis', 'radius_mean','texture_mean','perimeter_mean','area_mean', 'smoothness_mean','compactness_mean', 'symmetry_se', 'perimeter_worst','smoothness_worst', 'concavity_worst']
X = data[feature_cols]
y = data.diagnosis

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)

from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

When running the code this error occurs:

could not convert string to float: 'concavity_worst'

Couldn't find a similar problem on stackoverflow. After doing research I found the function fit() apparently can't take strings. But I'm not sure how to convert the strings to floats. After doing some googling I still couldn't find a solution working for this situation.

Example used for the code: https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f

Upvotes: 1

Views: 1325

Answers (1)

Nicolas Gervais
Nicolas Gervais

Reputation: 36624

According to the pandas documentation of pd.read_csv, you need to

Explicitly pass header=0 to be able to replace existing names

If you don't, it will take the file header as part of the data. So, now all your column names are mixed with your data, and all of your columns contain a string. That will crash the regression because it can't take strings as input.

Upvotes: 0

Related Questions