Stark_Tony
Stark_Tony

Reputation: 87

Problem in Code " Could not convert string to float"

I am learning the linear regression from a github link "https://github.com/Anubhav1107/Machine_Learning_A-Z/blob/master/Part%202%20-%20Regression/Section%205%20-%20Multiple%20Linear%20Regression/multiple_linear_regression.py"

but when I tried making it, this occurs:

ValueError                                Traceback (most recent call last)
<ipython-input-26-860be404cdc9> in <module>()
      1 sc_y = StandardScaler()
----> 2 y_train = sc_y.fit_transform(y_train)

4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'Florida'

I am running it on Google Colab, I have already converted the Categorical Features, so I don't understand what the problem is.

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()


# Splitting the dataset into the Training set and Test set

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

Upvotes: 1

Views: 3479

Answers (1)

desertnaut
desertnaut

Reputation: 60321

There is a reason why in How to create a Minimal, Reproducible Example we ask that:

Make sure all information necessary to reproduce the problem is included in the question itself

and not in some external file, parts of which you may or you may have not executed correctly.

I am saying this because I cannot reproduce your error; executing the relevant parts of the linked code works OK here:

import numpy as np
import pandas as pd
import sklearn
sklearn.__version__
# '0.21.3'

# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split  # model_selection here, due to newer version of scikit_learn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# FutureWarning here, irrelevant to the issue

At this stage, we have:

y_train
# result:
array([ 96778.92,  96479.51, 105733.54,  96712.8 , 124266.9 , 155752.6 ,
       132602.65,  64926.08,  35673.41, 101004.64, 129917.04,  99937.59,
        97427.84, 126992.93,  71498.49, 118474.03,  69758.98, 152211.77,
       134307.35, 107404.34, 156991.12, 125370.37,  78239.91,  14681.4 ,
       191792.06, 141585.52,  89949.14, 108552.04, 156122.51, 108733.99,
        90708.19, 111313.02, 122776.86, 149759.96,  81005.76,  49490.75,
       182901.99, 192261.83,  42559.73,  65200.33])

which I bet is not the case with your (not shown) full code.

Modifying slightly the last line below to y_train.reshape(-1,1) (again, irrelevant to the issue - if not we get a different error, asking to do so), we have:

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1,1))  # reshape here

which works OK, giving

y_train
# result
array([[-0.31304376],
       [-0.32044287],
       [-0.09175449],
       [-0.31467774],
       [ 0.3662475 ],
       [ 1.14433163],
       [ 0.57224308],
       [-1.10020076],
       [-1.82310158],
       [-0.20861649],
       [ 0.50587547],
       [-0.23498575],
       [-0.29700745],
       [ 0.43361398],
       [-0.93778138],
       [ 0.22309235],
       [-0.98076868],
       [ 1.05682957],
       [ 0.61437014],
       [-0.05046517],
       [ 1.17493831],
       [ 0.39351679],
       [-0.77118537],
       [-2.34186247],
       [ 2.03494965],
       [ 0.79423047],
       [-0.48182335],
       [-0.02210286],
       [ 1.15347296],
       [-0.01760646],
       [-0.46306547],
       [ 0.04612731],
       [ 0.32942519],
       [ 0.9962397 ],
       [-0.70283485],
       [-1.4816433 ],
       [ 1.81525556],
       [ 2.04655875],
       [-1.65292476],
       [-1.09342341]])

It certainly seems that, instead of y = dataset.iloc[:, 4].values, you have asked for y = dataset.iloc[:, 3].values, which gives:

dataset.iloc[:, 3].values
# result:
array(['New York', 'California', 'Florida', 'New York', 'Florida',
       'New York', 'California', 'Florida', 'New York', 'California',
       'Florida', 'California', 'Florida', 'California', 'Florida',
       'New York', 'California', 'New York', 'Florida', 'New York',
       'California', 'New York', 'Florida', 'Florida', 'New York',
       'California', 'Florida', 'New York', 'Florida', 'New York',
       'Florida', 'New York', 'California', 'Florida', 'California',
       'New York', 'Florida', 'California', 'New York', 'California',
       'California', 'Florida', 'California', 'New York', 'California',
       'New York', 'Florida', 'California', 'New York', 'California'],
      dtype=object)

With this change, the above code indeed gives:

y_train
# result:
array(['Florida', 'New York', 'Florida', 'California', 'Florida',
       'Florida', 'Florida', 'New York', 'New York', 'New York',
       'New York', 'Florida', 'California', 'California', 'California',
       'California', 'New York', 'New York', 'California', 'California',
       'New York', 'New York', 'California', 'California', 'California',
       'Florida', 'California', 'New York', 'California', 'Florida',
       'Florida', 'New York', 'New York', 'California', 'California',
       'Florida', 'New York', 'New York', 'California', 'California'],
      dtype=object)

and eventually:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-4a9512e0c95c> in <module>
      5 X_test = sc_X.transform(X_test)
      6 sc_y = StandardScaler()
----> 7 y_train = sc_y.fit_transform(y_train.reshape(-1,1))

[...]
ValueError: could not convert string to float: 'Florida'

Upvotes: 1

Related Questions