Reputation: 77
I'm trying to turn a categorical string column into several dummy variable binary columns, but I'm getting a valueerror.
Here's the code:
import sys, os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from dateutil import parser
import math
import traceback
import logging
datasetMod = pd.read_csv('data.csv')
X = datasetMod.iloc[:, 3:6].values
y = datasetMod.iloc[:, 1].values
print(X[:, 0])
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
try:
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
print(exc_type, fname, exc_tb.tb_lineno)
Here's the error:
<class 'ValueError'> multipleLinearRegression.py 23
The result from that print statement of that column is:
['Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Weekend' 'Workday' 'Workday' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday' 'Workday'
'Workday' 'Workday' 'Workday' 'Workday' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend' 'Weekend'
'Weekend' 'Weekend' 'Weekend' 'Weekend']
There doesn't seem to be anything wrong with the strings themselves, no whitespaces inbetween, no numeric like notation. So I don't understand why i'm getting a valuetype can't convert string to float error.
Any help would be highly appreciated.
Update
The onehotencoder works somewhat fine now, but the final result is of type object, while it's supposed to be of type float64:
labelencoder_X = LabelEncoder()
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
X[:, 2] = labelencoder_X.fit_transform(X[:, 2])
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [1,2,3])
onehotencoder.fit(X[:, 1])
onehotencoder.fit(X[:, 2])
onehotencoder.fit(X[:, 3])
onehotencoder.transform(X[:, 1])
onehotencoder.transform(X[:, 2])
onehotencoder.transform(X[:, 3])
X = onehotencoder.toArray()
Update 2
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
X[:, 2] = labelencoder_X.fit_transform(X[:, 2])
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [1,2,3])
X[:, 1] = onehotencoder.fit_transform(X[:, 1]).toarray()
X[:, 2] = onehotencoder.fit_transform(X[:, 2]).toarray()
X[:, 3] = onehotencoder.fit_transform(X[:, 3]).toarray()
print(X.dtype) #object
Final Code
Since the categorical_features
already dictates the indexes, i can fit_transform() on the whole matrix X
. Thanks to @mkos for the patience!
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
X[:, 2] = labelencoder_X.fit_transform(X[:, 2])
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [1,2,3])
X = onehotencoder.fit_transform(X)
Upvotes: 1
Views: 1496
Reputation: 428
This should do the trick:
onehotencoder = OneHotEncoder(categorical_features = [1,2,3])
X = onehotencoder.fit_transform(X)
you can print it with:
print(X.toArray())
Having X
as a sparse matrix is not bad, because it saves memory. If you want to see it, then you convert it to regular np.array
with toArray()
.
Upvotes: 2