not able to convert string to float in python and how to train the model with this dataset

Question

I have a dataset with columns: age (float type), gender (str type), regions (str type) and charges(float type).

I want to predict charges using age gender and region as features, how can I do that in scikit learn?

I have tried something but it shows "ValueError: could not convert string to float: 'northwest' "

import pandas as pd
import numpy as np
df = pd.read_csv('Desktop/insurance.csv')
X = df.loc[:,['age','sex','region']].values
y = df.loc[:,['charges']].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn import svm
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

ysearka · Accepted Answer

The column region contains strings, which can't be used as such in the SVM classifier as it is not a vector.

Threfore you have to turn this column into something that is usable by the SVM. Here is an example by changing region into a categorical series:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'age':[20,30,40,50],
              'sex':['male','female','female','male'],
              'region':['northwest','southwest','northeast','southeast'],
              'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df.region = pd.Categorical(df.region)
df.region = df.region.cat.codes
X = df.loc[:,['age','sex','region']]
y = df.loc[:,['charges']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

Another way to approach this problem is to use one-hot vector encoding:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'age':[20,30,40,50],
              'sex':['male','female','female','male'],
              'region':['northwest','southwest','northeast','southeast'],
              'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df = pd.concat([df,pd.get_dummies(df.region)],axis = 1).drop('region',1)
X = df.drop('charges',1)
y = df.charges
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

Yet another approach is to perform label encoding:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.region = le.fit_transform(df.region)

This list of methods is of course non-exhaustive, and they perform differently according to your problem.

The use of non-numeric data is a non-trivial one, and requires a bit of knowledge on the existing techniques (I encourage you to go and search in kaggle's forums where you can find valuable informations).

not able to convert string to float in python and how to train the model with this dataset

Answers (1)

Related Questions