hatim
hatim

Reputation: 23

not able to convert string to float in python and how to train the model with this dataset

I have a dataset with columns: age (float type), gender (str type), regions (str type) and charges(float type).

I want to predict charges using age gender and region as features, how can I do that in scikit learn?

I have tried something but it shows "ValueError: could not convert string to float: 'northwest' "

import pandas as pd
import numpy as np
df = pd.read_csv('Desktop/insurance.csv')
X = df.loc[:,['age','sex','region']].values
y = df.loc[:,['charges']].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn import svm
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

Upvotes: 0

Views: 3210

Answers (1)

ysearka
ysearka

Reputation: 3855

The column region contains strings, which can't be used as such in the SVM classifier as it is not a vector.

Threfore you have to turn this column into something that is usable by the SVM. Here is an example by changing region into a categorical series:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'age':[20,30,40,50],
              'sex':['male','female','female','male'],
              'region':['northwest','southwest','northeast','southeast'],
              'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df.region = pd.Categorical(df.region)
df.region = df.region.cat.codes
X = df.loc[:,['age','sex','region']]
y = df.loc[:,['charges']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

Another way to approach this problem is to use one-hot vector encoding:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'age':[20,30,40,50],
              'sex':['male','female','female','male'],
              'region':['northwest','southwest','northeast','southeast'],
              'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df = pd.concat([df,pd.get_dummies(df.region)],axis = 1).drop('region',1)
X = df.drop('charges',1)
y = df.charges
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

Yet another approach is to perform label encoding:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.region = le.fit_transform(df.region)

This list of methods is of course non-exhaustive, and they perform differently according to your problem.

The use of non-numeric data is a non-trivial one, and requires a bit of knowledge on the existing techniques (I encourage you to go and search in kaggle's forums where you can find valuable informations).

Upvotes: 2

Related Questions