user3104352
user3104352

Reputation: 1130

could not convert categorical data to number OneHotEncoder

I have a simple code to convert categorical data into one hot encoding in python:

a,1,p
b,3,r
a,5,t

I tried to convert them with python OneHotEncoder:

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
one_hot_encoder.fit(data.values)

This piece of code does not work and throws an error

ValueError: could not convert string to float: 't'

Can you please help me?

Upvotes: 1

Views: 755

Answers (2)

Bahman Engheta
Bahman Engheta

Reputation: 116

@user3104352,

I encountered the same behavior and found it frustrating.

Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.

Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).

This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.

I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.

Upvotes: 1

ankit srivastava
ankit srivastava

Reputation: 116

Try this:

from sklearn import preprocessing

for c in df.columns:
    df[c]=df[c].apply(str)
    le=preprocessing.LabelEncoder().fit(df[c])
    df[c] =le.transform(df[c])
    pd.to_numeric(df[c]).astype(np.float)

Upvotes: 1

Related Questions