One hot encoding categorical features to use as training data with numerical features in sklearn

I am trying to train a model that reads data from a csv as the training data. To do this I am trying to conduct one hot encoding on the categorical features, and then pass the resulting arrays of 1s and 0s in as features, along with just the vanilla numerical features.

I have the following code:

X = pd.read_csv('Data2Cut.csv')

Y = X.select_dtypes(include=[object])

le = preprocessing.LabelEncoder()

Y_2 = Y.apply(le.fit_transform)


enc = preprocessing.OneHotEncoder()

enc.fit(Y_2)

onehotlabels = enc.transform(Y_2).toarray()
onehotlabels.shape

features = []
labels = []
mycsv = csv.reader(open('Data2Cut.csv'))
indexCount = 0
for row in mycsv:
  if indexCount < 8426:
    features.append([onehotlabels[indexCount], row[1], row[2], row[3], row[6], row[8], row[9], row[10], row[11]])
    labels.append(row[12])
    indexCount = indexCount + 1

training_data = np.array(features, dtype = 'float_')
training_labels = np.array(labels, dtype = 'float_')

log = linear_model.LogisticRegression()
log = log.fit(training_data, training_labels)
joblib.dump(log, "modelLogisticRegression.pkl")

It seems to be getting to the line:

training_data = np.array(features, dtype = 'float_')

Before it crashes giving the following error:

ValueError: setting an array element with a sequence.

I figure this is a result of the one hot encoded values being arrays and not floats. How can I change/tweak this code to handle the categorical and numerical features as training data?

Edit: an example of a row i am feeding in, where each column is a feature is:

mobile, 1498885897, 17491407,   23911,  west coast, 2,  seagull, 18,    41.0666666667,  [0.325, 0.35],  [u'text', u'font', u'writing', u'line'],    102, 5  
#...

Upvotes: 0

Answers (2)

Ehsan Khaveh

Reputation: 1454

You must have already found your answer, but I am posting my findings (I was struggling with the same) here for people who have the same question. The way to achieve this is to append the columns of the resulting encoded sparse matrix to your training dataframe instead. E.g. (ignore the mistake in the price in the first row):

This is of course a practical solution if you do not have too many unique values in your categories. You could look into more advanced encoding methods such as Backward Difference Coding or Polynomial Coding for cases where your categorical features can take many different values.

Upvotes: 1

Siva-Sg

Reputation: 2821

Which version of sklearn are you using?

I see that in sklearn version 0.18.1 , passing 1d arrays as data is deprecated and gives a warning as below and does not give the desired result.

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)

Try replacing the following line of code

onehotlabels = enc.transform(Y_2).toarray()

to one below

onehotlabels = enc.transform(Y_2.reshape((-1,1)).toarray()

or you may use pd.get_dummies to get the one hot coded feature matrix.

Upvotes: 0

One hot encoding categorical features to use as training data with numerical features in sklearn

Answers (2)

Related Questions