Reputation: 1130
I have a simple code to convert categorical data into one hot encoding in python:
a,1,p
b,3,r
a,5,t
I tried to convert them with python OneHotEncoder:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
one_hot_encoder.fit(data.values)
This piece of code does not work and throws an error
ValueError: could not convert string to float: 't'
Can you please help me?
Upvotes: 1
Views: 755
Reputation: 116
@user3104352,
I encountered the same behavior and found it frustrating.
Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features
parameter.
Specifically, the column selection is handled by the _transform_selected()
method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
.
This check fails if any of the data in the provided dataframe X
cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.
Upvotes: 1
Reputation: 116
Try this:
from sklearn import preprocessing
for c in df.columns:
df[c]=df[c].apply(str)
le=preprocessing.LabelEncoder().fit(df[c])
df[c] =le.transform(df[c])
pd.to_numeric(df[c]).astype(np.float)
Upvotes: 1