Lin Ma
Lin Ma

Reputation: 10139

logistic regression feature value normalization in scikit-learn

Using Python 2.7. The question is about fit method. Question is for features (provided by parameter X), if there are non-numeric features (e.g. string type features, like Male, Female), do I need, or it is recommended to convert into numeric features (for performance and other reasons)? And also if I have multi-value string type features (e.g. feature geo could be any value of San Francisco, San Jose, Mountain View, etc.)

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit

regards, Lin

Upvotes: 0

Views: 1063

Answers (2)

Barak Krakauer
Barak Krakauer

Reputation: 21

Just to add a bit to MhFarahani's answer: Yes, you need to convert those labels to numerical values (generally 0 or 1). For things like gender, you would want to have a row that has 0 for male and 1 for female, or vice versa. For something like geographical location, it'd be a bit more complicated. If there's a reasonable number of possible answers, you could use the get_dummies function in pandas (check the doc here) to automatically populate your dataframe with rows to represent each possible location; you could then drop one of those rows to make that location the 'default'.

Upvotes: 2

MhFarahani
MhFarahani

Reputation: 970

You must encode categorical features and convert them to numerical values, if you want to use sklearn. This apples to all sklearn estimators (including LogisticRegression) and it does not matter which version of python you are using.

look at 4.3.4. Encoding categorical features of http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features for more information.

Upvotes: 1

Related Questions