Reputation: 87
I am learning the linear regression from a github link "https://github.com/Anubhav1107/Machine_Learning_A-Z/blob/master/Part%202%20-%20Regression/Section%205%20-%20Multiple%20Linear%20Regression/multiple_linear_regression.py"
but when I tried making it, this occurs:
ValueError Traceback (most recent call last)
<ipython-input-26-860be404cdc9> in <module>()
1 sc_y = StandardScaler()
----> 2 y_train = sc_y.fit_transform(y_train)
4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: could not convert string to float: 'Florida'
I am running it on Google Colab, I have already converted the Categorical Features, so I don't understand what the problem is.
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
Upvotes: 1
Views: 3479
Reputation: 60321
There is a reason why in How to create a Minimal, Reproducible Example we ask that:
Make sure all information necessary to reproduce the problem is included in the question itself
and not in some external file, parts of which you may or you may have not executed correctly.
I am saying this because I cannot reproduce your error; executing the relevant parts of the linked code works OK here:
import numpy as np
import pandas as pd
import sklearn
sklearn.__version__
# '0.21.3'
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split # model_selection here, due to newer version of scikit_learn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# FutureWarning here, irrelevant to the issue
At this stage, we have:
y_train
# result:
array([ 96778.92, 96479.51, 105733.54, 96712.8 , 124266.9 , 155752.6 ,
132602.65, 64926.08, 35673.41, 101004.64, 129917.04, 99937.59,
97427.84, 126992.93, 71498.49, 118474.03, 69758.98, 152211.77,
134307.35, 107404.34, 156991.12, 125370.37, 78239.91, 14681.4 ,
191792.06, 141585.52, 89949.14, 108552.04, 156122.51, 108733.99,
90708.19, 111313.02, 122776.86, 149759.96, 81005.76, 49490.75,
182901.99, 192261.83, 42559.73, 65200.33])
which I bet is not the case with your (not shown) full code.
Modifying slightly the last line below to y_train.reshape(-1,1)
(again, irrelevant to the issue - if not we get a different error, asking to do so), we have:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1,1)) # reshape here
which works OK, giving
y_train
# result
array([[-0.31304376],
[-0.32044287],
[-0.09175449],
[-0.31467774],
[ 0.3662475 ],
[ 1.14433163],
[ 0.57224308],
[-1.10020076],
[-1.82310158],
[-0.20861649],
[ 0.50587547],
[-0.23498575],
[-0.29700745],
[ 0.43361398],
[-0.93778138],
[ 0.22309235],
[-0.98076868],
[ 1.05682957],
[ 0.61437014],
[-0.05046517],
[ 1.17493831],
[ 0.39351679],
[-0.77118537],
[-2.34186247],
[ 2.03494965],
[ 0.79423047],
[-0.48182335],
[-0.02210286],
[ 1.15347296],
[-0.01760646],
[-0.46306547],
[ 0.04612731],
[ 0.32942519],
[ 0.9962397 ],
[-0.70283485],
[-1.4816433 ],
[ 1.81525556],
[ 2.04655875],
[-1.65292476],
[-1.09342341]])
It certainly seems that, instead of y = dataset.iloc[:, 4].values
, you have asked for y = dataset.iloc[:, 3].values
, which gives:
dataset.iloc[:, 3].values
# result:
array(['New York', 'California', 'Florida', 'New York', 'Florida',
'New York', 'California', 'Florida', 'New York', 'California',
'Florida', 'California', 'Florida', 'California', 'Florida',
'New York', 'California', 'New York', 'Florida', 'New York',
'California', 'New York', 'Florida', 'Florida', 'New York',
'California', 'Florida', 'New York', 'Florida', 'New York',
'Florida', 'New York', 'California', 'Florida', 'California',
'New York', 'Florida', 'California', 'New York', 'California',
'California', 'Florida', 'California', 'New York', 'California',
'New York', 'Florida', 'California', 'New York', 'California'],
dtype=object)
With this change, the above code indeed gives:
y_train
# result:
array(['Florida', 'New York', 'Florida', 'California', 'Florida',
'Florida', 'Florida', 'New York', 'New York', 'New York',
'New York', 'Florida', 'California', 'California', 'California',
'California', 'New York', 'New York', 'California', 'California',
'New York', 'New York', 'California', 'California', 'California',
'Florida', 'California', 'New York', 'California', 'Florida',
'Florida', 'New York', 'New York', 'California', 'California',
'Florida', 'New York', 'New York', 'California', 'California'],
dtype=object)
and eventually:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-4a9512e0c95c> in <module>
5 X_test = sc_X.transform(X_test)
6 sc_y = StandardScaler()
----> 7 y_train = sc_y.fit_transform(y_train.reshape(-1,1))
[...]
ValueError: could not convert string to float: 'Florida'
Upvotes: 1