Reputation: 6159
I just started learning machine learning, when practicing one of the task, I am getting value error, but I followed the same steps as the instructor does.
I am getting value error, please help.
dff
Country Name
0 AUS Sri
1 USA Vignesh
2 IND Pechi
3 USA Raj
First I performed labelencoding,
X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])
out:
X
array([[0, 'Sri'],
[2, 'Vignesh'],
[1, 'Pechi'],
[2, 'Raj']], dtype=object)
then performed One hot encoding for the same X
onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
I am getting the below error:
ValueError Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
1900 """
1901 return _transform_selected(X, self._fit_transform,
-> 1902 self.categorical_features, copy=True)
1903
1904 def _transform(self, X):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
1695 X : array or sparse matrix, shape=(n_samples, n_features_new)
1696 """
-> 1697 X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
1698
1699 if isinstance(selected, six.string_types) and selected == "all":
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float: 'Raj'
Please edit my question is anything wrong, thanks in advance!
Upvotes: 2
Views: 14392
Reputation: 1
to make the long story short, if you are looking to dummify your df, use dummy=pd.get_dummies
as:
dummy=pd.get_dummies(df['str'])
df=pd.concat([df,dummy], axis=1)
print(Data)
Upvotes: 0
Reputation: 3969
You can go directly to OneHotEncoding now without using the LabelEncoder, and as we move toward version 0.22 many might want to do things this way to avoid warnings and potential errors (see DOCS and EXAMPLES).
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
countries = np.unique(X[:,0])
names = np.unique(X[:,1])
ohe = OneHotEncoder(categories=[countries, names])
X = ohe.fit_transform(X).toarray()
print (X)
[[1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 0.]
[0. 0. 1. 0. 1. 0. 0.]]
The first 3 columns encode the country names, the last four the personal names.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()
print (X)
[[1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 0.]
[0. 0. 1. 0. 1. 0. 0.]]
Now, here's the unique part. What if you only need to One Hot Encode a specific column for your data?
(Note: I've left the last column as strings for easier illustration. In reality it makes more sense to do this WHEN the last column was already numerical).
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
countries = np.unique(X[:,0])
names = np.unique(X[:,1])
ohe = OneHotEncoder(categories=[countries]) # specify ONLY unique country names
tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()
X = np.append(tmp, names.reshape(-1,1), axis=1)
print (X)
[[1.0 0.0 0.0 'Pechi']
[0.0 0.0 1.0 'Raj']
[0.0 1.0 0.0 'Sri']
[0.0 0.0 1.0 'Vignesh']]
Upvotes: 7
Reputation: 3632
An alternative if you do want to encode multiple categorical features is to use a Pipeline with a FeatureUnion and a couple custom Transformers.
First need two transformers - one for selecting a single column and one for making LabelEncoder usable in a Pipeline (The fit_transform method only takes X, it needs to take an optional y to work in a Pipeline).
from sklearn.base import BaseEstimator, TransformerMixin
class SingleColumnSelector(TransformerMixin, BaseEstimator):
def __init__(self, column):
self.column = column
def transform(self, X, y=None):
return X[:, self.column].reshape(-1, 1)
def fit(self, X, y=None):
return self
class PipelineAwareLabelEncoder(TransformerMixin, BaseEstimator):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return LabelEncoder().fit_transform(X).reshape(-1, 1)
Next create a Pipeline (or just a FeatureUnion) which has 2 branches - one for each of the categorical columns. Within each select 1 column, encode the labels and then one hot encode.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
pipeline = Pipeline([(
'encoded_features',
FeatureUnion([('countries',
make_pipeline(
SingleColumnSelector(0),
PipelineAwareLabelEncoder(),
OneHotEncoder()
)),
('names', make_pipeline(
SingleColumnSelector(1),
PipelineAwareLabelEncoder(),
OneHotEncoder()
))
]))
])
Finally run your full dataframe through the Pipeline - it will one hot encode each column separately and concatenate at the end.
df = pd.DataFrame([["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]], columns=['Country', 'Name'])
X = df.values
transformed_X = pipeline.fit_transform(X)
print(transformed_X.toarray())
Which returns (first 3 columns are the countries, second 4 are the names)
[[ 1. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0. 0. 1.]
[ 0. 1. 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 1. 0. 0.]]
Upvotes: 3
Reputation: 3535
Below implementation should work well. Note that the input of onehotencoder
fit_transform
must not be 1-rank array and also output is sparse and we have used to_array()
to expand it.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
le = LabelEncoder()
X_num = le.fit_transform(X[:,0]).reshape(-1,1)
ohe = OneHotEncoder()
X_num = ohe.fit_transform(X_num)
print (X_num.toarray())
X[:,0] = X_num
print (X)
Upvotes: 4