Reputation: 432
I'm trying to understand how to use categorical data as features in sklearn.linear_model
's LogisticRegression
.
I understand of course I need to encode it.
What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using preprocessing.LabelEncoder()
, DictVectorizer.vocabulary
or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.
Especially with the first one!
Upvotes: 15
Views: 30192
Reputation: 4519
You can create indicator variables for different categories. For example:
animal_names = {'mouse';'cat';'dog'}
Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')
Then we have:
[0 [0
Indicator_cat = 1 Indicator_dog = 0
0] 1]
And you can concatenate these onto your original data matrix:
X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]
Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).
[1 1 0 0
1 0 1 0
1 0 0 1]
Notice how constant term, an indicator for mouse, an indicator for cat and an indicator for dog leads to a less than full column rank matrix: the first column is the sum of the last three.
Upvotes: 5
Reputation: 554
Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index
of categorical column names:
import pandas as pd
catColumns = df.select_dtypes(['object']).columns
Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder()
to convert it to 0
and 1
. For categorical variables with more than two categories, use pd.getDummies()
to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for col in catColumns:
n = len(df[col].unique())
if (n > 2):
X = pd.get_dummies(df[col])
X = X.drop(X.columns[0], axis=1)
df[X.columns] = X
df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional)
else:
le.fit(df[col])
df[col] = le.transform(df[col])
Upvotes: 2
Reputation: 9390
It's completely different classes:
[DictVectorizer][2].vocabulary_
A dictionary mapping feature names to feature indices.
i.e After fit()
DictVectorizer
has all possible feature names, and now it knows in which particular column it will place particular value of a feature. So DictVectorizer.vocabulary_
contains indicies of features, but not values.
LabelEncoder
in opposite maps each possible label (Label could be string, or integer) to some integer value, and returns 1D vector of these integer values.
Upvotes: 3